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ETAPS Foreword 


Welcome to the 26th ETAPS! ETAPS 2023 took place in Paris, the beautiful capital of 
France. ETAPS 2023 was the 26th instance of the European Joint Conferences on 
Theory and Practice of Software. ETAPS is an annual federated conference established 
in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organising these conferences in a coherent, 
highly synchronized conference programme enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops took place that 
attracted many researchers from all over the globe. 

ETAPS 2023 received 361 submissions in total, 124 of which were accepted, 
yielding an overall acceptance rate of 34.3%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2023 featured the unifying invited speakers Véronique Cortier (CNRS, 
LORIA laboratory, France) and Thomas A. Henzinger (Institute of Science and 
Technology, Austria) and the conference-specific invited speakers Mooly Sagiv (Tel 
Aviv University, Israel) for ESOP and Sven Apel (Saarland University, Germany) for 
FASE. Invited tutorials were provided by Ana-Lucia Varbanescu (University of 
Twente and University of Amsterdam, The Netherlands) on heterogeneous computing 
and Joost-Pieter Katoen (RWTH Aachen, Germany and University of Twente, The 
Netherlands) on probabilistic programming. 

As part of the programme we had the second edition of TOOLympics, an event to 
celebrate the achievements of the various competitions or comparative evaluations in 
the field of ETAPS. 

ETAPS 2023 was organized jointly by Sorbonne Université and Université 
Sorbonne Paris Nord. Sorbonne Université (SU) is a _ multidisciplinary, 
research-intensive and worldclass academic institution. It was created in 2018 as the 
merge of two first-class research-intensive universities, UPMC (Université Pierre and 
Marie Curie) and Paris-Sorbonne. SU has three faculties: humanities, medicine, and 
55,600 students (4,700 PhD students; 10,200 international students), 6,400 teachers, 
professor-researchers and 3,600 administrative and technical staff members. Université 
Sorbonne Paris Nord is one of the thirteen universities that succeeded the University of 
Paris in 1968. It is a major teaching and research center located in the north of Paris. It 
has five campuses, spread over the two departments of Seine-Saint-Denis and Val 
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d’Oise: Villetaneuse, Bobigny, Saint-Denis, the Plaine Saint-Denis and Argenteuil. The 
university has more than 25,000 students in different fields, such as health, medicine, 
languages, humanities, and science. The local organization team consisted of Fabrice 
Kordon (general co-chair), Laure Petrucci (general co-chair), Benedikt Bollig (work- 
shops), Stefan Haar (workshops), Etienne André (proceedings and tutorials), Céline 
Ghibaudo (sponsoring), Denis Poitrenaud (web), Stefan Schwoon (web), Benoit Barbot 
(publicity), Nathalie Sznajder (publicity), Anne-Marie Reytier (communication), 
Hélène Pétridis (finance) and Véronique Criart (finance). 

ETAPS 2023 is further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), EASST 
(European Association of Software Science and Technology), Lip6 (Laboratoire 
d'Informatique de Paris 6), LIPN (Laboratoire d'informatique de Paris Nord), Sorbonne 
Université, Université Sorbonne Paris Nord, CNRS (Centre national de la recherche 
scientifique), CEA (Commissariat a l'énergie atomique et aux énergies alternatives), 
LMF (Laboratoire méthodes formelles), and Inria (Institut national de recherche en 
informatique et en automatique). 

The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saar- 
briicken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Inria), Jan Křetínský (Munich), 
and Lenore Zuck (Chicago). 

Other members of the steering committee are: Dirk Beyer (Munich), Luis Caires 
(Lisboa), Ana Cavalcanti (York), Bernd Finkbeiner (Saarland), Reiko Heckel 
(Leicester), Joost-Pieter Katoen (Aachen and Twente), Naoki Kobayashi (Tokyo), 
Fabrice Kordon (Paris), Laura Kovacs (Vienna), Orna Kupferman (Jerusalem), Leen 
Lambers (Cottbus), Tiziana Margaria (Limerick), Andrzej Murawski (Oxford), Laure 
Petrucci (Paris), Elizabeth Polgreen (Edinburgh), Peter Ryan (Luxembourg), Sriram 
Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Natasha Sharygina (Lugano), 
Pawel Sobocinski (Tallinn), Sebastian Uchitel (London and Buenos Aires), Andrzej 
Wasowski (Copenhagen), Stephanie Weirich (Pennsylvania), Thomas Wies (New 
York), Anton Wijs (Eindhoven), and James Worrell (Oxford). 

I would like to take this opportunity to thank all authors, keynote speakers, atten- 
dees, organizers of the satellite workshops, and Springer-Verlag GmbH for their 
support. I hope you all enjoyed ETAPS 2023. 

Finally, a big thanks to Laure and Fabrice and their local organization team for all 
their enormous efforts to make ETAPS a fantastic event. 


April 2023 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


We are pleased to present the proceedings of TACAS 2023, the 29th edition of the 
International Conference on Tools and Algorithms for the Construction and Analysis of 
Systems held as part of the 26th European Joint Conferences on Theory and Practice of 
Software (ETAPS 2023), April 24—28, 2023 in Paris, France. TACAS brings together a 
community of researchers, developers, and end-users who are broadly interested in 
rigorous algorithmic techniques for the construction and analysis of systems. The 
conference is a venue that interleaves various disciplines including formal verification 
of software and hardware systems, static analysis, program synthesis, verification of 
machine learning/autonomous systems, probabilistic programming, SAT/SMT solving, 
constraint solving, static analysis, automated theorem proving and Cyber-Physical 
Systems. 
There were five submission categories for TACAS 2023: 


— 


. Regular research papers advancing the theoretical foundations for the construc- 
tion and analysis of systems. 

2. Case study papers describing the application of state-of-the-art research techniques 
on real-world applications. 

3. Regular tool papers presenting a new tool, a new tool component, or novel 
extensions to an existing tool of interest to the community. 

4. Tool demonstration papers focusing on the usage aspects of tools. 

5. SV-COMP competition tool papers organized as a separate conference track. 


Regular research, case study, and regular tool papers were restricted to a total of 
sixteen pages, and tool demonstration papers to six pages, exclusive of references. 

This year 169 papers were submitted to TACAS, consisting of 119 regular research 
papers, 34 regular tool and case study papers, and 16 tool demonstration papers. Each 
paper was reviewed by three Program Committee (PC) members, who made use of sub- 
reviewers. As a result, the PC accepted in total 62 papers, among which there were 45 
regular papers, 11 regular tool/case-study papers and 6 tool demonstration papers. The 
PC members were pleasantly surprised by an unusually large number of strong sub- 
missions. Almost all accepted papers had either all positive reviews or a “championing” 
program committee member who argued in favor of accepting the paper. Furthermore, 
all accepted papers had a positive average score. One paper was accepted conditionally 
and successfully “shepherded” by the PC. 

Similarly to previous years, it was possible to submit an artifact alongside a paper, 
which was mandatory for regular tool and tool demonstration papers. An artifact might 
consist of tools, models, proofs, or other data required for validation of the results 
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of the paper. The Artifact Evaluation Committee (AEC) reviewed the artifacts based on 
their documentation, ease of use, and, most importantly, whether the results presented 
in the corresponding paper could be accurately reproduced. The evaluation was carried 
out using a standardized virtual machine to ensure consistency of the results, except for 
4 artifacts that had special hardware or software requirements. The evaluation had two 
rounds. The first round was carried out in parallel with the work of the PC and 
evaluated the artifacts for all the submitted regular tool and tool demo papers. The 
judgment of the AEC was communicated to the PC and weighed in their discussion 
(the PC rejected a total of 4 papers in this phase). The second round took place after the 
paper acceptance notifications were sent out so the authors of accepted research and 
case-study papers could submit their artifacts. In both rounds, the AEC provided 3 
reviews per artifact and communicated with the authors to resolve apparent technical 
issues. In total, 69 artifacts were submitted (51 in the first round and 18 in the second), 
and the AEC evaluated a total of 64 artifacts regarding their availability, functionality, 
and/or reusability. Finally, among the 62 accepted papers, the AEC awarded 32 
functional badges, 21 reusable badges, and 33 available badges. Such badges appear on 
the first page of each paper to certify the properties of each artifact. 

As a separate conference track, TACAS 2023 hosted the 12th Competition on 
Software Verification (SV-COMP 2023). SV-COMP is the annual comparative eval- 
uation of tools for automatic software verification and witness validation. The TACAS 
proceedings contain a selection of 13 short papers that describe participating verifi- 
cation systems and a report presenting the results of the competition. These papers were 
reviewed by a separate program committee (the competition jury), each of the papers 
was assessed by at least three reviewers. A total of 52 verification systems were 
systematically evaluated, with 34 developer teams from ten countries, including five 
submissions from industry. Two sessions in the TACAS program were reserved for the 
competition: presentations by the competition chair and the participating development 
teams in the first session and an open community meeting in the second session. 

We would like to thank all the people who helped to make TACAS 2023 successful. 
First, we would like to thank the authors for submitting their papers to TACAS 2023. 
The PC members and additional reviewers did a great job in reviewing papers: they 
contributed informed and detailed reports and engaged in the PC discussions. We also 
thank the steering committee, and especially its chair, Joost-Pieter Katoen, for his 
valuable advice. Lastly, we would like to thank the overall organization team of 
ETAPS 2023. 
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Abstract. Reinforcement learning has received much attention for learn- 
ing controllers of deterministic systems. We consider a learner-verifier 
framework for stochastic control systems and survey recent methods that 
formally guarantee a conjunction of reachability and safety properties. 
Given a property and a lower bound on the probability of the property 
being satisfied, our framework jointly learns a control policy and a for- 
mal certificate to ensure the satisfaction of the property with a desired 
probability threshold. Both the control policy and the formal certifi- 
cate are continuous functions from states to reals, which are learned as 
parameterized neural networks. While in the deterministic case, the cer- 
tificates are invariant and barrier functions for safety, or Lyapunov and 
ranking functions for liveness, in the stochastic case the certificates are 
supermartingales. For certificate verification, we use interval arithmetic 
abstract interpretation to bound the expected values of neural network 
functions. 


Keywords: Learning-based control - Stochastic systems - Martingales. 
- Formal verification 


1 Introduction 


Learning-based control and verification of learned controllers. Learning-based 
control and reinforcement learning (RL) were empirically demonstrated to have 
enormous potential to solve highly non-linear control tasks. However, their de- 
ployment in safety-critical scenarios such as autonomous driving or healthcare 
requires safety assurances. Most safety-aware RL algorithms optimize expected 
reward while only empirically trying to maximize safety probability. This to- 
gether with the non-explainable nature of neural network controllers obtained 
via deep RL raise questions about the trustworthiness of learning-based methods 
for safety-critical applications [9,27]. To that end, formal verification of learned 
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controllers as well as learning-based control with formal safety guarantees have 
become very active research topics. 


Learning certificate functions. A classical approach to formally proving proper- 
ties of dynamical systems is to compute a certificate function. A certificate func- 
tion [26] is a function that assigns real values to system states and its defining 
conditions imply satisfaction of the property. Thus, in order to prove the prop- 
erty of interest, it suffices to compute a certificate function for that property. 
For instance, Lyapunov functions [46] and barrier functions [50] are standard 
certificate functions for proving reachability of some target set and avoidance of 
some unsafe set of system states, respectively, when the system dynamics are 
deterministic. While both Lyapunov and barrier functions are well-studied con- 
cepts in dynamical systems theory, early methods for their computation either 
required designing the certificates by hand or using computationally intractable 
numerical procedures. A more recent approach reduces certificate computation 
to a semi-definite programming problem by using sum-of-squares (SOS) tech- 
niques [33,49,37]. However, a limitation of this approach is that it is only appli- 
cable to polynomial systems and computation of polynomial certificate functions, 
whereas it is not applicable to systems with general non-linearities. Moreover, 
SOS methods do not scale well with the dimension of the system. 

Learning-based methods are a promising approach to overcome these limi- 
tations and they have received much attention in recent years. These methods 
jointly learn a neural network control policy and a neural network certificate 
function, e.g. a Lyapunov function [53,18,3,17| or a barrier function [38,58,52,1], 
depending on the property of interest. The neural network certificate is then 
formally verified, ensuring that these methods provide formal guarantees. Both 
learning and verification procedures developed for verifying neural network cer- 
tificates are not restricted to polynomial dynamical systems. See [26] for an 
overview of existing learning-based control methods that learn a certificate func- 
tion to verify a system property in deterministic dynamical systems. 


Prior works — deterministic dynamical systems. While the above works present 
significant advancements in learning-based control and verification of dynamical 
systems, they are predominantly restricted to deterministic dynamical systems. 
In other words, they assume that they have access to the exact dynamics function 
according to which the system evolves. However, for most control tasks, the 
underlying models used by control methods are imperfect approximations of 
real systems inferred from observed data. Thus, control and verification methods 
should also account for model uncertainty due to the noise in observed data and 
the approximate nature of model inference. 


This survey — stochastic dynamical systems. In this work, we survey recent devel- 
opments in learning-based methods for control and verification of discrete-time 
stochastic dynamical systems, based on [44,68]. Stochastic dynamical systems 
use probability distributions to quantify and model uncertainty. In stochastic 
dynamical systems, given a property of interest and a probability parameter 
p € [0,1], the goal is to learn a control policy and a formal certificate which 
guarantees that the system under the learned policy satisfies the property of 
interest with probability at least p. 
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Supermartingale certificate functions. Lyapunov functions and barrier functions 
can be used to prove properties in deterministic dynamical systems, however 
they are not applicable to stochastic dynamical systems and do not allow rea- 
soning about the probability of a property being satisfied. Instead, the learning- 
based methods of [44,68] use supermartingale certificate functions to formally 
prove properties in stochastic systems. Supermartingales are a class of stochas- 
tic processes that decrease in expected value at every time step [66]. Their nice 
convergence properties and concentration bounds allow their use in designing 
certificate functions for stochastic dynamical systems. In particular, ranking su- 
permartingales (RSMs) [15,44] were used to verify probability 1 reachability and 
stochastic barrier functions (SBFs) [50] were used to verify safety with the speci- 
fied probability p € [0, 1]. Reach-avoid supermartingales (RASMs) [68] unify and 
extend these two concepts and were used to verify reach-avoidance properties 
with the specified probability p € [0,1], i.e. a conjunction of reachability and 
safety properties. We define and compare these concepts in Section 3. 


Certificate candidate 


a 


Learner Verifier 


we a ee 


Counterexample set 


Fig. 1: Schematic illustration of the learner-verifier loop. 


Learner-verifier framework for stochastic dynamical systems. In Section 4, we 
then present a learner-verifier framework of [44,68] for learning-based control 
and for the verification of learned controllers in stochastic dynamical systems 
in a counterexample guided inductive synthesis (CEGIS) fashion [55]. The al- 
gorithm jointly learns a neural network control policy and a neural network 
supermartingale certificate function. It consists of two modules — the learner, 
which learns a policy and a supermartingale certificate function candidate, and 
the verifier, which then formally verifies the candidate supermartingale certifi- 
cate function. If the verification step fails, the verifier computes counterexamples 
and passes them back to the learner, which tries to learn a new candidate. This 
loop is repeated until a candidate is successfully verified, see Fig. 1. 

This framework builds on the existing learner-verifier methods for learning- 
based control in deterministic dynamical systems [18,2,26]. However, the ex- 
tension of this framework to stochastic dynamical systems and the synthesis 
of supermartingale certificate functions is far from straightforward. In particu- 
lar, the methods of [18,2] use knowledge of the deterministic dynamics function 
to reduce the verification task to a decision procedure and use an off-the-shelf 
solver. However, verification of the expected decrease condition of supermartin- 
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gale certificates by reduction to a decision procedure would require being able 
to compute a closed-form expression of the expected value of a neural network 
function over a probability distribution and provide it to the solver. It is not clear 
how the closed-form expression can be computed, and it is not known whether 
the closed-form expression exists in the general case. 

This challenge is solved by using a method for efficient computation of tight 
upper and lower bounds on the expected value of a neural network function. The 
verifier module then verifies the expected decrease condition by discretizing the 
state space and formally verifying a slightly stricter condition at the discretiza- 
tion points by using the computed expected value bounds. By carefully choosing 
the mesh of the discretization and adding an additional error term, we obtain 
a sound verification method applicable to general Lipschitz continuous systems. 
The expected value bound computation for neural network functions relies on 
interval arithmetic and abstract interpretation, and since it is of independent 
interest, we discuss it in detail in Section 5. We are not aware of any existing 
methods that tackle this problem. 


Extension to general stochastic certificates. We conclude this survey with a dis- 
cussion of possible extensions of the learner-verifier framework in Section 6 and 
of related work in Section 7. 


2 Preliminaries 


We consider discrete-time stochastic dynamical systems defined via 
Xr+1 = f (Xt, Ut, wt), Xo € Xo. 


The function f : X x Ux N + X is the dynamics function of the system 
and t € No is the time index. We use X C R™ to denote the system state 
space, U C R” the control action space and M C R? the stochastic disturbance 
space. For each t € No, x; € ¥ the state of the system, u, € U the action and 
wi E N the stochastic disturbance vector at time t. The set Xo C X is the set 
of initial states. In each time step, u+ is chosen according to a control policy 
T: X > U, ie. uy = T(x), and uw; is sampled according to some specified 
probability distribution d over R?. The dynamics function f, control policy 7 
and probability distribution d together define a stochastic feedback loop system. 
A trajectory of the system is a sequence (Xz, Ut, Wt)teNo Such that, for each 
t € No, we have uz = 7(xz), w € support(d) and Xt+1 = f(x¢, Ut, wt). For each 
initial state xp € X, the system induces a Markov process. This gives rise to the 
probability space over the set of all trajectories of the system that start in xo [51]. 
We denote the probability measure and the expectation in this probability space 
by Px, and E,,, respectively. 
Assumptions. We assume that X C R”, Xo CR™, U C R” and N C R are all 
Borel-measurable. This is necessary for the probability space of the set of all sys- 
tem trajectories starting in some initial state to be mathematically well-defined. 
We also assume that 7 C R™ is compact (i.e. closed and bounded) and that the 
dynamics function f is Lipschitz continuous, which are common assumptions in 
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control theory. Finally, we assume that the probability distribution d is a prod- 
uct of independent univariate probability distributions, which is necessary for 
efficient sampling and expected value computation. 


2.1 Brief Overview of Martingale Theory 


In this subsection, we provide a brief overview of definitions and results from 
martingale theory that lie at the core of formal reasoning about supermartingale 
certificate functions. We assume that the reader is familiar with the mathemati- 
cal definitions of probability space, measurability and random variables, see [66] 
for the necessary background. The results in this subsection will help in building 
an intuition on supermartingale certificate functions, but omitting them would 
not prevent the reader from following the rest of this paper. 


Probability space. A probability space is a triple (2,7,P) where 2 is a state 
space, F is a sigma-algebra and P is a probability measure which is required to 
satisfy Kolmogorov axioms [66]. A random variable is a function X : R > R that 
is F-measurable. We use E[X] to denote the expected value of X. A (discrete- 
time) stochastic process is a sequence (X;)%29 of random variables in (2, F, P). 


Conditional expectation. Let X be a random variable in a probability space 
(2,F,P). Given a sub-o-algebra F’ C F, a conditional expectation of X given 
F' is an F'-measurable random variable Y such that, for each A € F’, we have 


[X -1(A)] = E[Y - I(A)}. 


Here, I(A) : 2 — {0,1} is an indicator function of A defined via I(A)(w) = 1 
if w € A, and I(A)(w) = 0 if w ¢ A. Intuitively, conditional expectation of X 
given F’ is an F’-measurable random variable that behaves like X whenever 
its expected value is taken over an event in F’. Conditional expectation of a 
random variable X given F’ is guaranteed to exist if X is real-valued and non- 
negative [66]. Moreover, for any two conditional expectations Y and Y’ of X 
given F’, we have that P[Y = Y’] = 1. Therefore, the conditional expectation is 
almost-surely unique and we may pick one such random variable as a canonical 
conditional expectation and denote it by E[X | ’]. 

Supermartingales. Let (N, F, P) be a probability space and Fo CFC- CF 
be an increasing sequence of sub-o-algebras in F with respect to inclusion. A non- 
negative supermartingale with respect to (F;)%29 is a stochastic process (X;)%2 
such that each X; is F;-measurable, and X;(w) > 0 and E[Xj41 | Fi](w) < Xi(w) 
hold for each w € 2 and i > 0. Intuitively, the second condition says that the 
expected value of X;,, given the value of X; has to decrease. This condition is 
formalized by using conditional expectation. 

The following two results that will be key technical ingredients in our design 
of supermartingale certificate functions. The first theorem shows that nonneg- 
ative supermartingales have nice convergence properties and converge almost- 
surely to some finite value. The second theorem bounds the probability that the 
value of the supemartingale ever exceeds some threshold, and it will allow us to 
bound from above the probability of occurrence of some bad event. 
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Theorem 1 (Supermartingale convergence theorem [66]). Let (X;)%2o 
be a nonnegative supermartingale with respect to (F;)%29. Then, there exists a 
random variable Xoo in (2,F,P) to which the supermartingale converges to with 
probability 1, i.e. Plim; Xi = Xx] = 1. 


Theorem 2 ([41]). Let (X;)%25 be a nonnegative supermartingale with respect 
to (Fi)?2o- Then, for every real A > 0, we have P[sup;>o Xi = A] < E[Xo]/). 


2.2 Problem Statement 


We now formally define the properties and control tasks that we focus on in 
this work. In what follows, let Xi, ¥u C Æ be disjoint Borel-measurable sets and 
p € [0,1] be a lower bound on the probability with which the system under the 
learned controller needs to satisfy the property: 


— Reachability. Let Reach(%;) = {(Xz, Uz, wt)teno | It € No. x: € Xi} be the 
set of all trajectories that reach the target set ¥;. The goal is to learn a 
control policy under which the system reaches 4; with probability at least 
p, ie. Px, [Reach(%;)] > p holds for every initial state x9 € Xp. 

— Safety (or avoidance). Let Safe(X¥u) = {(Xt, Ut, w2)ten, | VE < t.xe Z Xu} 
be the set of all trajectories that do not visit the unsafe set Xu. The goal is 
to learn a control policy under which the system stays away from %, with 
probability at least p, i.e. Px, [Safe(%,)] > p holds for every initial state 
Xo € Xo. 

— Reach-avoidance. Let ReachAvoid(&X;, Xu) = { (Xt, Ut, wt)teNo | St E€ No. x: € 
Xı A (VE < t.xy Z Xu)} be the set of all trajectories that reach ¥; without 
reaching %,,. The goal is to learn a control policy under which the sys- 
tem reaches ¥, while staying away from ¥,„ with probability at least p, 
i.e. Px, [ReachAvoid(%;, %,,)] > p holds for every initial state xo € Xo. 


3 Supermartingale Certificate Functions 


We now overview three classes of supermartingale certificate functions that 
formally prove reachability, safety and reach-avoidance properties. Supermartin- 
gale certificate functions do not refer to a single class of certificate functions. 
Rather, we use this term to refer to all certificate functions that exhibit a 
supermartingale-like behavior and can formally verify properties in stochastic 
dynamical systems. In what follows, we assume that the control policy m is 
fixed. In the following section, we will then present a learner-verifier framework 
for jointly learning a control policy and a supermartingale certificate function. 


RSMs for probability 1 reachability. We start with ranking supermartingales 
(RSMs), which can prove probability 1 reachability of some target set 4. Intu- 
itively, an RSM is a continuous function that maps system states to nonnegative 
real values and is required to strictly decrease in expectation by some e€ > 0 in 
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every time step until the target X, is reached. Due to the strict expected de- 
crease as well as the Supermartingale Convergence Theorem (Theorem 1), one 
can show that the existence of an RSM guarantees that the system under policy 
m reaches X, with probability 1. RSMs can be viewed as a stochastic extension 
of Lyapunov functions. Note that RSMs can only be used to prove probabil- 
ity 1 reachability, but cannot be used to reason about probabilistic reachability. 
RSMs were originally used for proving almost-sure termination in probabilistic 
programs [15] and were used to certify probability 1 reachability in stochastic 
dynamical systems in [44]. 


Definition 1 (Ranking supermartingales [44]). Let X C X be a target set. 
A continuous function V : X —> R is a ranking supermartingale (RSM) with 
respect to X, if it satisfies: 
1. Nonnegativity condition. V(x) > 0 for each x € X. 
2. Expected Decrease condition. There exists e > 0 such that, for each x € 
X\X;, we have V(x) > EwnralV (f(x, 7(x), w))] + €. 


Theorem 3 ([44]). Suppose that there exists an RSM with respect to X,. Then, 
for every Xo € Xo, we have Px, [Reach(%;)] = 1. 


SBFs for probabilistic safety. On the other hand, stochastic barrier functions 
(SBFs) can prove probabilistic safety. Given an unsafe set X, and probability 
p € (0,1), an SBF is also a continuous function mapping system states to non- 
negative real values, which is required to decrease in expectation at each time 
step. However, unlike RSMs, the expected decrease need not be strict and there 
is no target set. In addition, its initial value must be at most 1, whereas its value 
upon reaching an unsafe set must be at least 1/(1 — p). Thus, for the system 
under policy m to violate the safety constraint, the value of the SBF needs to 
increase from at most 1 to at least 1/(1—p) even though it is required to decrease 
in expectation. The probability of this event can be bounded from above and 
shown to be at most 1—p by using Theorem 2. We highlight the assumption that 
p < 1, which is necessary for the safety constraint to be mathematically defined. 
As the name suggests, SBFs are a stochastic extension of barrier functions. 


Definition 2 (Stochastic barrier functions [50]). Let X, C XY be an unsafe 
set and p € [0,1). A continuous function V : X — R is a stochastic barrier 
function (SBF) with respect to Xu and p if it satisfies: 

1. Nonnegativity condition. V(x) > 0 for each x € X. 

2. Initial condition. V(x) < 1 for each x € Xo. 

3. Safety condition. V(x) > for each x € Xau. 

4. Expected Decrease condition. For each x € &, if V(x) < = then V(x) > 
SwralV (F(x, T(x), w))]. 


Theorem 4 ([50]). Suppose that there exists an SBF with respect to Xu and p. 
Then, for every Xo € Xo, we have Px, [Safe(%,)] > p. 


RASMs for probabilistic reach-avoidance. Finally, reach-avoid supermartingales 
(RASMs) unify and extend RSMs and SBFs in the sense that they allow simul- 
taneous reasoning about reachability and safety and proving a conjunction of 
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these properties, i.e. reach-avoid properties. Let X, and %,, be disjoint target 
and unsafe sets and let p € [0, 1). Similarly to SBFs, an RASM is a continuous 
nonnegative function which is required to be initially at most 1 but needs to 
attain a value that is at least 1/(1 — p) for the unsafe region to be reached. On 
the other hand, similarly to RSMs, it is required to strictly decrease in expec- 
tation by e > 0 at every time step until either the target set X; or a state in 
which the value is at least 1/(1 — p) is reached. Thus, RASMs can be viewed as 
a stochastic extension of both Lyapunov functions and barrier functions, which 
combines the strict decrease of Lypaunov functions and the level-set reasoning 
of barrier functions. 


Definition 3 (Reach-avoid supermartingales [68]). Let X C X and Xu C 
X be a target set and an unsafe set, respectively, and let p € [0,1] be a probability 
threshold. Suppose that either p < 1 or that p= 1 and X, =. A continuous 
function V : X —> R is a reach-avoid supermartingale (RASM) with respect to 
Xi, Xu and p if it satisfies: 

1. Nonnegativity condition. V(x) > 0 for each x € X. 

2. Initial condition. V(x) < 1 for each x € Xo. 

3. Safety condition. V(x) > for each x € Xa. 

4. Expected Decrease condition. There exists e > 0 such that, for each x € 

X\X; at which V(x) < Ip we have V(x) > Euo~alV (f(x, n(x), w))] + €. 


1 


Theorem 5 ([68]). Suppose that there exists an RASM with respect to Xi, Xu 
and p. Then, for every Xo € Xo, we have Px, [ReachAvoid(%;, ¥u)] > p. 


Note that RASMs indeed unify and generalize the definitions of RSMs and 
SBFs. First, by setting X, = and p = 1 (so 1/(1 — p) = œ), RASMs reduce 
to RSMs as the Initial condition that can be enforced without loss of generality 
by rescaling. Second, by setting 4%; = Ø, RASMs reduce to SBFs. In this case, 
the Expected Decrease condition is strengthened as it requires strict decrease 
by e > 0. However, the proof of Theorem 5 which we outline below also implies 
Theorem 4 and e > 0 is only necessary to reason about the reachability of X4. 

We also note that RASMs strictly extend the applicability of RSMs, since 
RASMs can be used to prove reachability with any lower bound p € [0,1] on 
probability and not only probability 1 reachability. Indeed, if we set X, = @ and 
p € [0,1], in order to prove reachability of X, with probability at least p the 
RASMs require strict expected decrease in expectation by € > 0 until either 7; 
is reached or the RASM value exceeds 1/(1 — p) (with 1/(1 — p) = œ if p= 1). 

In the rest of this section, we outline the proof of Theorem 5 that was pre- 
sented in [68]. This proof also implies Theorem 3 and Theorem 4. We do this to 
highlight the connection of RSMs, SBFs and RASMs to the mathematical notion 
of supermartingale processes. We also do this to illustrate the tools from mar- 
tingale theory that are used in proving soundness of supermatingale certificate 
functions, as we envision that they may be useful in designing supermatingale 
certificate functions for more general classes of properties. 


Proof (proof sketch of Theorem 5). Here we outline the main ideas behind the 
proof, and for the full proof we refer the reader to [68]. Let x9 € Xo. We need to 
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show that Px,[|ReachAvoid(¥;, ¥u)] > p. To do this, we consider the probability 
space (xo, Fxo, Pxo ) of trajectories that start in xo and for each time step t € No 
define a random variable in this probability space via 


V(x), if x; Z X; and V(x;) < = for each O <i<t 
Xi(p) = 40, if x; € X, for some 0 <i < t, V(x;) < i; for each 0 <j <1 
a otherwise 
for each trajectory p = (X+, Uz, Wt)teNo E Rxo. Hence, (X;)?2 defines a stochastic 
process whose value at each time step is equal to the value of V at the current 
system state unless either the target set X, has been reached after which future 
values of X; are set to 0, or a state in which V exceeds 1/(1—p) has been reached 
after which future values of A; are set to 1/(1—p). It can be shown that (Xz)? 
is a nonnegative supermartingale (2Qx,,Fx,,Px,.). This claim can be proved by 
using the Nonnegativity and the Expected Decrease condition of RASMs. Here 
we do not yet need that the expected decrease is strict, i.e. € > 0 in the Expected 
Decrease condition of RASMs is sufficient. 
Since (X;)?2, is a nonnegative supermartingale, substituting à = 1/(1 — p) 
into the inequality in Theorem 2 shows that 


1 
Px, | sup X; > — | < (1 — p) - Exo Xo] < 1 — p. 
i>0 1l—p 


The second inequality follows since Xo(p) = V (xo) < 1 for every p € Rx, by the 
Initial condition of RASMs. Hence, by the Safety condition of RASMs it follows 
that the system under policy m reaches the unsafe set X, with probability at 
most 1 — p. Note that here we can already conclude the claim of Theorem 4. 
Finally, as (X;)?25 is a nonnegative supermartingale, by Theorem 1 its value 
converges with probability 1. One can then prove that this value has to be either 
0 or > 1/(1 — p) by using the fact that the expected decrease in the Expected 
Decrease condition of RASMs is strict. But we showed above that a state in 
which V is > 1/(1 — p) is reached with probability at most 1 — p. Hence, the 
probability that the system under policy m reaches the target set X, without 
reaching the unsafe set X, is at least p, i.e. Px,[ReachAvoid(A;, ¥u4)] > p. 


4 Learner-Verifier Framework for Stochastic Systems 


We now present the learner-verifier framework of [44,68] for the learning-based 
control and verification of learned controllers in stochastic dynamical systems. 
We focus on the probabilistic reach-avoid problem, assume that we are given a 
target set 1, unsafe set X, and a probability parameter p € [0,1], and learn a 
control policy 7 and an RASM which certifies that Px, [ReachAvoid(%;, ¥,,)] > p 
for all xo € æo. The algorithm for learning RSMs and SBFs can be obtained 
analogously, since we showed that RASMs unify and generalize RSMs and SBFs. 

The algorithm behind the learner-verifier framework consists of two modules 
— the learner, which learns a neural network control policy mọ and a neural 
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network supermartingale certificate function V,, and the verifier, which then 
formally verifies the learned candidate function. If the verification step fails, the 
verifier produces counterexamples that are passed back to the learner to fine-tune 
its loss function. Here, 0 and v are vectors of neural network parameters. The 
loop is repeated until either a certificate function is successfully verified, or some 
specified timeout is reached. By incorporating feedback from the verifier, the 
learner is able to tune the policy and the certificate function towards ensuring 
that the resulting policy meets the desired reach-avoid specification. 


Applications. As outlined above, the learner-verifier framework can be used for 
learning-based control with formal guarantees that a property of interest is satis- 
fied by jointly learning a control policy and a supermartingale certificate function 
for the property. On the other hand, it can also be used to formally verify a pre- 
viously learned control policy by fixing policy parameters and only learning a 
supermartingale certificate function. Finally, if one uses a different method to 
learn a policy that turns out to violate the desired property, one can use the 
learner-verifier framework to fine-tune an unsafe policy towards repairing it and 
obtaining a safe policy for which a supermartingale certificate function certifies 
that the property of interest is satisfied. 


4.1 Algorithm Initialization 


As mentioned in Section 1, the key challenge for the verifier is to check the Ex- 
pected Decrease condition of supermartingale certificates. Our algorithm solves 
this challenge by discretizing the state space and verifying a slightly stricter con- 
dition at discretization vertices which we show to imply the Expected Decrease 
condition over the whole region required by Definition 3. On the other hand, 
learning two neural networks in parallel while simultaneously optimizing several 
objectives can be unstable due to inherent dependencies between two networks. 
Thus, proper initialization of networks is important. We allow all neural net- 
work architectures so long as all activation functions are continuous functions. 
Furthermore, we apply the softplus activation function to the output neuron of 
V_, in order to ensure that the value of V, is always nonnegative. 


Discretization. A discretization Æ of X with mesh T > 0 is a set of states such 
that, for every x € X, there exists a state x € X such that ||x — X||ı < T. The 
algorithm takes mesh 7 as a parameter and computes a finite discretization x 
with mesh 7 by simply taking a hyper-rectangular grid of the sufficiently small 
cell size. Since ¥ is compact, this yields a finite discretization. 


Network initialization. The policy network mọ is initalized by running proximal 
policy optimization (PPO) [54] on the Markov decision process (MDP) defined by 
the stochastic dynamical system with a reward function r; = 1[4;](xz)—[%u] (xe). 


The discretization Æ is used to define three sets of states which are then used 
by the learner to initialize the certificate network V, and to which counterexam- 
ples computed by the verifier will be added later. In particular, the algorithm 
initializes Cinit = X N Xo, Cunsafe = X N Xu and Caecrease = VM (X\ X). 
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4.2 The Learner module 


The Learner updates the parameters 0 of the policy and v of the neural network 
certificate function candidate V, with the objective of the candidate satisfying 
the supermartingle certificate conditions. The parameter updates happen incre- 
mentally via gradient descent of the form 0 + 9—a Xe) and v + v— a EG) 
where a > 0 is the learning rate and £ is a loss function that corresponds to a 
differentiable optimization objective of the supermartingle certificate conditions. 
Ideally, the global minimum of £ should correspond to a policy m and a neu- 
ral network V, that fulfills all certificate conditions. In practice, however, due 
to the non-convexity of the network V,, gradient descent is not guaranteed to 
converge to the global minimum. As a result, the learner is not monotone, i.e. a 
new iteration does not guarantee improvement over the previous iteration. The 
training process usually applies a fixed number of gradient descent iterations or, 
alternatively, continues until a certain threshold on the loss value is achieved. 


Loss functions. The particular type of loss function £ depends on the type of 
supermartingale certificate function that should be learned by the network, but 
is of the general form 


L(A, v) = LCertificate(0, v) +à: (LLipschitz (0) + LLipschitz (v)), (1) 


where Leertificate is the specification-specific loss. The auxiliary loss terms Lripschitz 
regularize the training to obtain networks 7 and V, that have a low upper bound 
of their Lipschitz constant. The purpose of this regularization is that networks 
with low Lipschitz upper bound are easier to check by the verifier module, i.e. re- 
quiring a coarser discretization grid. The value of A > 0 decides the strength of 
the regularization that is applied. The regularization loss is based on the upper 
bound derived in [57] and defined as 


ô 
Luipschitz (0) = max {Lv T. (Ly . (Le + 1) + 1) í o}. (2) 


In the case of a reach-avoid specification, the RASM certificate loss is 
L certificate (9, v) = LExpected (0, v) + LUnsafe( V) + Linit (v), (3) 
with 


LeExpected (0, v) = ere 5 (max { 


| Cgecease| 
*xE Cexpected 


5 L Vo(x) +7: K,0}) 


W1,...,WNON 
Linit(v) = max {V,(x) — 1,0} 


xECinit 


1 
Lunsafe(V) = max {— ~ V(x), 0}. 


x€Cunsafe —p 


The sets Cexpectea, Cinit and Cunsafe are the training sets for achieving the ex- 
pected decrease, initial and unsafe RASM conditions. Each of the three sets is 
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initialized with a coarse discretization of the state space to guide the learning 
toward learning a correct RASM already in the first loop iteration. In the subse- 
quent calls to the learner, these sets are extended by counterexamples computed 
by the verifier. In [68] it was shown that, if Vg is a RASM and satisfies all con- 
ditions checked by the verifier below, then Leertificate(O,/) 4 0 as the number 
of samples N used to estimate expected values in LExpectea (0, v) increases. 


4.3 The Verifier module 


Verification task. The verifier now formally checks whether the learned RASM 
candidate V, satisfies the four RASM defining conditions in Definition 3. Since we 
applied the softplus activation function to the output neuron of V,, we know that 
the Nonnegativity condition is satisfied by default. Thus, the verifier only needs 
to check the Initial, Safety and Expected Decrease conditions in Definition 3. 


Expected Decrease condition. To check the Expected Decrease condition, we uti- 
lize the fact that the dynamics function f is Lipschitz continuous and that the 
state space ¥ is compact to show that it suffices to check a slightly stricter con- 
dition at the discretization points. Let Ly be a Lipschitz constant of f. Since me 
and V, are continuous functions defined over the compact domain ¥, we know 
that they are also Lipschitz continuous. Let Ly and Ly be their Lipschitz con- 
stants. We assume that Ly is provided to the algorithm, and use the method 
of [57] for computing neural network Lipschitz constants to compute Ly and Ly. 
_ To verify the Expected Decrease condition, the verifier collects a subset 
Xe C X of all discretization vertices whose adjacent grid cells contain a non- 
target state and over which V, attains a value that is smaller than = To 
compute this set, the algorithm first collects all grid cells that intersect V\%;. 
For each collected cell, it then uses interval arithmetic abstract interpretation 
(IA-AI) [24,30] to propagate interval bounds across neural network layers to- 
wards bounding from below the minimal value that V, attains over the cell. 
Finally, it adds to Xe vertices of those cells at which the computed lower bound 
is less than 1/(1 — p). 


Finally, the verifier checks if the following condition is satisfied at each x € x 


has [v.( F(z, o(%),~)) | < Vi(x)—7 °K, (4) 


where K = Ly - (Ly: (Lr +1) +1). Note that this condition is a strengthened 
version of the Expected Decrease condition, where instead of strict decrease by 
arbitrary € > 0 we require strict decrease by at least T- K which depends on 
the discretization mesh 7 and Lipschitz constants of f, mg and V,. To compute 
Tunal Vv (f(X, 79(X), w))] in eq. (4), we cannot simply evaluate the expected value 
in state x by substituting x into some expression, as we do not know a closed- 
form expression for the expected value of a neural network function. Instead, 
the algorithm uses the method of [44] to compute upper and lower bounds on 
the expected value of a neural network function, which we describe in Section 5. 
This upper bound is then plugged it into eq. (4). 

If no violations to eq. (4) are found, the verifier concludes that the Expected 
Decrease condition is satisfied. Otherwise, for any counterexample x to eq. (4), 
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the algorithm checks if x € V\#; and V,(x) < 1/(1 — p) and if so adds it to the 
counterexample set Cdecrease- 


Initial and safety conditions. The Initial and Safety conditions are checked using 
IA-AI. To check the Initial condition, the verifier collects the set Cells, of all 
grid cells that intersect the initial set Xo, and for each cell in Cells, checks if 


sup V,(x) > 1. (5) 


x€ cell 


The supremum is bounded from above via IA-AI by propagating interval bounds 
across neural network layers. If no violations are found, the verifier concludes 
that V, satisfies the Initial condition. Otherwise, vertices of any grid cells which 
are counterexamples to eq. (5) and which are contained in %% are added to Cinit. 
Analogously, to check the Safety condition, the verifier collects the set Cells x, 
of all grid cells that intersect the unsafe set ¥,,, and for each cell checks if 


1 
inf V, < ——. 
geal (x) 1— p 


(6) 
If no violations are found, the verifier concludes that V, satisfies the Safety 
condition. Otherwise, vertices of any grid cells which are counterexamples to 
eq. (6) and which are contained in %, are added to Cunsafe- 


Algorithm output and correctness. If all three checks are successful and no coun- 
terexample is found, the algorithm concludes that 7» guarantees reach-avoidance 
with probability at least p and outputs the policy pg. Otherwise, it proceeds to 
the next learner-verifier iteration where computed counterexamples are added to 
sets Cinit; Cunsafe and Cgecrease to be used by the learner. The following theorem 
establishes correctness of the verifier module, and its proof can be found in [68]. 


Theorem 6 ([68]). Suppose that the verifier verifies that the certificate V, sat- 
isfies eq. (4) for each X € £e, eq. (5) for each cell € Cells, and eq. (6) for each 
cell € Cellsx,. Then the function V, is an RASM for the system with respect to 
Xi, Xu and p. 


Optimizations. The verification task can be made more efficient by a discretiza- 
tion refinement procedure. In particular, the verifier may start with a coarse grid 
and decomposes each grid cell on demand into a finer discretization in case the 
check when some RASM condition fails. This procedure can be used recursively 
to refine further in the case when elements of the decomposed grid cannot be 
verified. In case the recursion encounters a grid element that violates Eq. 4 even 
for T = 0, the refinement procedure terminates unsuccessfully with the grid cen- 
ter point as a counterexample of the RASM condition. This optimization with 
a maximum recursion depth of 1 has been applied in [68]. 


5 Bounding Expected Values of Neural Networks 


We now present the method for computing upper and lower bounds on the 
expected value of a neural network function over a given probability distribution. 
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We are not aware of any existing methods for solving this problem, so believe 
that this is a result of independent interest. 

To define the setting of the problem at hand, let x € X C R” be a sys- 
tem state and suppose that we want to compute upper and lower bounds the 
expected value E,,.a[V (f(x, 7(x), w))]. Here d is a probability distribution over 
the stochastic disturbance space M C R? from which the stochastic disturbance 
is sampled independently at each time step. As noted in Section 2, we assume 
that dis a product of independent univariate probability distributions. Alterna- 
tively, the method is also applicable if the support of d is bounded. 

The method first partitions the stochastic disturbance space M C R? into 
finitely many cells cell( V) = {Ni,..., Ne}. Let maxvol = maxy,ecen(v) VOIN) 
and minvol = miny,¢cen(v) VOl(Mi) denote the maximal and the minimal vol- 
ume of any cell in the partition with respect to the Lebesgue measure over R?, 
respectively. Also, for each w E€ N let F(w) = V(f(x,7(x),w)). The upper and 
the lower boundd on the expected value are computed as follows 


wuna V (| f(x, T(x), w) l < maxvol- sup F(w), 

al ( ) new on 

wuna V| f(x, (x), w) }| > minvol- inf F(w). 
[penea] Smit ag 


Each supremum (resp. infimum) in the sum is then bounded from above (resp. from 
below) via interval arithmetic abstract interpretation by using the method of [30]. 

If the support of d is bounded, then no further adjustments are needed. 
However, if the support of d is unbounded, maxvol and minvol may not be finite. 
In this case, since we assume that d is a product of univariate distributions, the 
method first applies the probability integral transform [48] to each univariate 
probability distribution in d in order to reduce the problem to the case of a 
probability distribution of bounded support. 


6 Discussion on Extension to General Certificates 


The focus of this survey has primarily been on three concrete classes of super- 
martingale certificate functions in stochastic systems, namely RSMs, SBFs and 
RASMs, and the learner-verifier framework for their computation. For each class 
of supemartingale certificate functions, the learner module encodes the defining 
conditions of the certificate as a differentiable loss function whose minimiza- 
tion leads to a candidate certificate function. The verifier module then formally 
checks whether the defining conditions of the certificate function are satisfied. 
These checks are performed by discretizing the state space and using interval 
arithmetic abstract interpretation and the previously discussed method for com- 
puting bounds on expected values of neural network functions. 

It should be noted that the design of both the learner and the verifier modules 
was not specifically tailored to any of the three certificate functions. Rather, both 
the learner and the verifier follow very general design principles that we envision 
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are applicable to more general classes of certificate functions. In particular, we 
hypothesize that as long as the state space of the system is compact and a 
certificate function can be defined in terms of 


— exact and expected value evaluations of Lipschitz continuous functions, and 
— inequalities between such evaluations imposed over state space regions, 


then the learner-verifier framework in Section 4 may present a promising ap- 
proach to learning and verifying the certificate function. In particular, the learner- 
verifier framework presents a natural candidate for automating the computa- 
tion of any supermartingale certificate function that may be designed for other 
properties in the future. Furthermore, while RSMs, SBFs and RASMs exhibit 
a supermartingale-like behavior which is fundamental for their soundness, the 
learner-verifier framework does not rely or depend on their supermartingale-like 
behavior. Hence, we envision that the learner-verifier framework could also be 
used to compute other classes of stochastic certificate functions. 

Even more generally, note that all certificate functions that we have consid- 
ered so far are of the type ¥ — R. One could also consider extensions of the 
learner-verifier framework to learning certificate functions of different datatypes. 
For instance, the work [43] uses a learner-verifier framework to learn an induc- 
tive transition invariant of type ¥ x ¥ —> R that certifies safety in deterministic 
systems. On the other hand, lexicographic ranking supermartingales are a multi- 
dimensional generalization of RSMs of type ¥ — R* that provide a more efficient 
and compositional approach to proving probability 1 termination in probabilistic 
programs [5,22]. Studying possible extensions of the learner-verifier framework 
for stochastic systems to learn certificate functions of different arity of both 
domain and codomain is a very interesting direction of future work. 


7 Related Work 


Existing learning-based methods for learning and verification of certificate func- 
tions in deterministic and stochastic systems have been discussed in Section 1. In 
this section, we overview some other existing methods for verification and con- 
trol of stochastic dynamical systems, as well as some other uses of martingale 
theory in stochastic system verification. 


Abstraction-based methods. Another class of approaches to stochastic dynami- 
cal system control with formal safety guarantees are abstraction based meth- 
ods [56,42,14,63,60,25]. These methods consider finite-time horizon systems and 
approximate them via a finite-state Markov decision process (MDP). The control 
problem is then solved for the obtained MDP and the computed policy is used 
to exhibit a policy for the original stochastic dynamical system. The key differ- 
ence in applicability between abstraction based methods and our framework is 
that abstraction based methods consider finite-time horizon systems, whereas 
we consider infinite-time horizon systems. 


Safe control via shielding. Shielding is an RL framework that ensures safety in 
the context of avoidance of unsafe regions by computing two control policies — 
the main policy that optimizes the expected reward, and the backup policy that 
the system falls back to whenever the safety constraint may be violated [7,36,29]. 
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Constrained MDPs. A standard approach to safe RL is to solve constrained 
MDPs (CMDPs) [8,28] which impose hard constraints on expected cost for one or 
more auxiliary cost functions. Several efficient RL algorithms for solving CMDPs 
have been proposed [59,4], however their constraints are only satisfied in expec- 
tation, hence constraint satisfaction is not formally guaranteed. 


RL reward specification and neurosymbolic methods. There are several works 
on solving model-free RL tasks under logic specifications. In particular, several 
works propose methods for designing reward functions that encode temporal 
logic specifications [6,12,32,31,45,34,13,40,39]. Formal methods have also been 
used for extraction of interpretable policies [62,61,35] and safe RL [10,67,11]. 


Deterministic systems with stochastic controllers. Another way to give rise to a 
stochastic dynamical system is to consider a dynamical system with deterministic 
dynamics function and use a stochastic controller, which helps in quantifying 
uncertainty in the controller’s prediction. Formal verification of deterministic 
dynamical systems with Bayesian neural network controllers has been considered 
in [43]. In particular, this work also uses a learner-verifier method to learn an 
inductive invariant for the deterministic system which formally proves safety. 


Supermartingales for probabilistic program analysis. Supermartingales have also 
been used for the analysis of probabilistic programs (PPs). In particular, RSMs 
were originally introduced in the setting of PPs to prove almost-sure termi- 
nation [15] and have since been extensively used, see e.g. [19,20,5,47,22]. The 
work [1] proposed a learner-verifier method to learn an RSM in the PP. Super- 
martingales were also used for safety [23,64,21], cost [65] and recurrence and 
persistence [16] analysis in PPs. 


8 Conclusion 


This paper presents a framework for learning-based control with formal reach- 
ability, safety and reach-avoidance guarantees in stochastic dynamical systems. 
We present a learner-verifier framework in which a neural network control pol- 
icy is learned together with a neural network certificate function that formally 
proves that the property of interest holds with at least some desired proba- 
bility p € [0,1]. For certification, we use supermartingale certificate functions. 
The learner module encodes the defining certificate function conditions into a 
differentiable loss function which is then minimized to learn a candidate certifi- 
cate function. The verifier then formally verifies the candidate by using interval 
arithmetic abstract interpretation and a novel method for computing bounds on 
expected values of neural networks. 

The learner-verifier framework presented in this work opens several interest- 
ing directions for future work. The first is the design of supermartingale cer- 
tificates for more general properties of stochastic systems and the use of our 
learner-verifier framework for their computation. The second is to study and un- 
derstand the general class of certificate functions in stochastic systems that the 
learner-verifier can be used to compute, possibly going beyond supermartingale 
certificate functions. Finally, on the practical side, a venue for future work is 
to explore methods for reducing the computational cost of the framework and 
extensions that can handle more complex and higher dimensional systems. 
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Abstract. Many types of attacks on confidentiality stem from the non- 
deterministic nature of the environment that computer programs operate 
in. We focus on verification of confidentiality in nondeterministic envi- 
ronments by reasoning about asynchronous hyperproperties. We general- 
ize the temporal logic A-HLTL to allow nested trajectory quantification, 
where a trajectory determines how different execution traces may ad- 
vance and stutter. We propose a bounded model checking algorithm for 
A-HLTL based on QBF-solving for a fragment of A-HLTL and evaluate 
it by various case studies on concurrent programs, scheduling attacks, 
compiler optimization, speculative execution, and cache timing attacks. 
We also rigorously analyze the complexity of model checking A-HLTL. 


1 Introduction 


Motivation. Consider the concurrent program [10] : Thread TLO { 


await sem>0 then 


shown in Fig. 1, where h is a secret variable, and await ; gee 

: ij sg š ‘ 4 print a); 
command is a conditional critical region. This program + v = v+l; 

: P A r 2 6 G T 
should satisfy the following information-flow policy: “Any $ REPED i a, 


sequences of observable outputs produced by an interleav- * }! 


ing should be reproducible by some other interleaving for ° ree ae 


a different value of h”. If this is the case, then an attacker » if h then 


await sem>0 then 


cannot successfully guess the value of h from the sequence 


1 
1 sem = sem — 1; 
j 5 = V2 
of observable outputs of the print () statements. For ex- |, A E 
ample, Fig. 2 shows how one can align two interleavings |7 ee 


of threads T1 and T2 with respect to the observable se- of ede (aeg 
quence of outputs ‘abcd’, given two different values of 
secret h. Let us call such an alignment a trajectory (il- 
lustrated by the sequence of dashed lines). However, if 
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thread T1 holds the semaphore and executes the critical region as an atomic 
operation. Then, output ‘acdb’ arising due to concurrent execution of threads 
T1 and T2 reveals the value of h as 0, as the same output cannot be reproduced 


when h=1. Thus, the program in Fig. 1 violates the above policy. 


The above policy is an example of a hyperprop- 
erty [5]; i.e., a set of sets of execution traces. In ad- 
dition to information-flow requirements, hyperproper- 
ties can express other complex requirements such as lin- 
earizability [12] and control conditions in cyber-physical 
systems such as robustness and sensitivity. The tempo- 
ral logic A-HLTL [1] can express hyperproperties whose 
sets of traces advance at different speeds, allowing stut- 
tering steps. For example, the above policy can be ex- 
pressed in A-HLTL by the following formula: ym = 
Yr.In' ET. (hr, Æ hwr) A O(obsz,, = obs,z-), where 
obs denotes the output observations, meaning that for 


all executions (i.e., interleavings) 7, there should exist » 


1 


another execution 7’ and a trajectory 7, such that m »: 
and x’ start from different values of h and 7 can align 3; 


all the observations along 7 and 7’ (see Fig. 2). A-HLTL ~ 


can reason about one source of nondeterminism by the * 


scheduler in the system that may lead to information » 


leak. Indeed, the model checking algorithms proposed _:: 


n [1] can discover the bug in the program in Fig. 1. 
Now, consider a more complex version of the same 
program shown in Fig. 3 inspired by modern program- 
ming languages such as Go and P that allow CSP-style 
concurrency. Here, new threads T3 and T4 read the val- 
ues of secret input h and public input 1 from two asyn- 


Fig. 3: 
receive 


Thread T1 (){ 
while (true){ 
await sem>0 then 
sem = sem — l; 
printita y; 
v = v+1; 
print bj; 
sem = sem + 1; 
} 
} 


Thread T2(){ 
while (true) 
h = read (Channell); 
} 


Thread T3(){ 
while (true){ 
pruntite i: 
it (h == 1} then 
await sem>0 then 


sem = sem — l; 
v = v+2; 
sem = sem + 1; 
else 
skip; 
print d): 


} 
} 
Thread T4(){ 


while (true) 
1 = read(Channel2) ; 
} 


Ti and T2 
inputs from 


asynch. channels read 
by T3 and T4. 


chronous channels, rendering two different sources of nondeterminism: (1) the 
scheduler that results in different interleavings, and (2) data availability in the 
channels. This, in turn, means formula yı no longer captures the following 
specification of the program, which should be: 


“Any sequence of observable outputs produced by an interleaving should 
be reproducible by some other interleaving such that for all alignments of 
public inputs, there exists an alignment of the public outputs”. 


Satisfaction of this policy (not expressible in A-HLTL as proposed in [1]) prohibits 
an attacker from successfully determining the sequence of values of h. 


T1-4 


§COSS-S 
“OOOO 


T1-4 1-5 


T1-6 


T1-6 


Fig. 2: Two secure interleavings for the program in Fig. 1 
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Contributions. In this paper, we strive for a general logic-based approach that 
enables model checking of a rich set of asynchronous hyperproperties. To this 
end, we concentrate on A-HLTL model checking for programs subject to multiple 
sources of nondeterminism. Our first contribution is a generalization of A-HLTL 
that allows nested trajectory quantification. For example, the above policy re- 
quires reasoning about two different trajectories that cannot be composed into 
one since their sources of nondeterminism are different. This observation moti- 
vates the need for enriching A-HLTL with the tools to quantify over trajectories. 
This generalization enables expressing policies such as follows: 


PNia = Yr. In .AT.ET'.( Olhr,r Æ hate AT (lr = lee) > Olobsa. = obsr r), 


where A and E denote the universal (res., existential) trajectory quantifiers. 

Our second contribution is a bounded model checking (BMC) algorithm for 
a fragment of the extended A-HLTL that allows an arbitrary number of trace 
quantifier alternations and up to one trajectory quantifier alternation. Follow- 
ing [15], we propose two bounded semantics (called optimistic and pessimistic) 
for A-HLTL based on the satisfaction of eventualities. We introduce a reduction to 
the satisfiability problem for quantified Boolean formulas (QBF) and prove that 
our translation provides decision procedures for A-HLTL BMC for terminating 
systems, i.e., those whose Kripke structure is acyclic. Our focus on terminating 
programs is due to the general undecidability of A-HLTL model checking [1]. As 
in the classic BMC for LTL, the power of our technique is in hunting bugs that 
are often in the shallow parts of reachable states. 

Our third contribution is rigorous com- Ma ese 
plexity analysis of A-HLTL model checking 


NL-complete 


. p MEJA (Theorem 2) 
for terminating programs (see Table 1). We IEEE a 
. % Xk-complete 3 
show that for formulas with only one trajec- VENE/AJ P J 


tory quantifier the complexity is aligned with e E 
that of classic synchronous semantics of Hy- i 


; 3(3/V)*+ (E+E) 5? -complete 
perLTL [4]. However, the complexity of A-HLTL eT = “ 3 
. . . $ k+17C0mplete = 

model checking with multiple trajectory quan- SG MtAtey ce 
A A A 2p 417-com ete 4 
tifiers is one step higher than HyperLTL model WOVE AY] iene z 


checking in the polynomial hierarchy. An in- MALL pepace 
teresting observation here is that the complex- 
ity of model checking a formula with two exis- Table 1: A-HLTL model checking 
tential trajectory quantifiers is one step higher complexity for acyclic models. 
than one with only one existential quantifier 

although the plurality of the quantifiers does not change. Generally speaking, 
A-HLTL model checking for terminating programs remains in PSPACE. 

Finally, we have implemented our BMC technique. We evaluate our imple- 
mentation on verification of four case studies: (1) information-flow security in 
concurrent programs, (2) information leak in speculative executions, (3) preser- 
vation of security in compiler optimization, and (4) cache-based timing attacks. 
These case studies exhibit a proof of concept for the highly intricate nature 
of information-flow requirements and how our foundational theoretical results 
handle them. 
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Related Work. The concept of hyperproperties is due to Clarkson and Schnei- 
der [5]. HyperLTL [4] and A-HLTL are currently the only logics for which practical 
model checking algorithms are known [8,7,15,1]. For HyperLTL, the algorithms 
have been implemented in the model checkers MCHYPER and bounded model 
checker HYPERQB [14]. HyperLTL is limited to synchronous hyperproperties. The 
A-HLTL model checking problem is known to be undecidable in general [1]. How- 
ever, decidable fragments that can express observational determinism, noninter- 
ference, and linearizability have been identified. This paper generalizes A-HLTL 
by allowing nested trajectory quantifiers and due to the general undecidability 
result focuses on terminating programs. 

FOLJE] [6] can express a limited form of asynchronous hyperproperties. As 
shown in [6], FOL[E] is subsumed by HyperLTL with additional quantification 
over predicates. For S1S[E] and H,,, the model checking problem is in general 
undecidable; for H,,, two fragments, the k-synchronous, k-context bounded frag- 
ments, have been identified for which model checking remains decidable [11]. 
Other logical extensions of HyperLTL with asynchronous capabilities are studied 
in [3], including their decidable fragments, but their model checking problems 
have not been implemented and the relative expressive power with respect to 
other asynchronous formalisms has not been studied. 


2 Extended Asynchronous HyperLTL 


Preliminaries. Given a natural number k € No, we use [k] for the set {0,..., k}. 
Let AP be a set of atomic propositions and © = 24? be the alphabet, where we 
call each element of © a letter. A trace is an infinite sequence o = aga ,--- of 
letters from X. We denote the set of all infinite traces by ©“. We use o(7) for a; 
and o' for the suffix ajaj41---. A pointed trace is a pair (ø, p), where p € No is 
a natural number (called the pointer). Pointed traces allow to traverse a trace 
by moving the pointer. Given a pointed trace (o, p) and n > 0, we use (o, p) +n 
to denote the resulting trace (o, p + n). We denote the set of all pointed traces 
by PTR = {(0,p) |o € EY and p € No}. 

A Kripke structure is a tuple K = (S, Sinit, Ô, L), where S is a set of states, 
Sinit E S is the initial state, ô C S x S is a transition relation, and L: S > X 
is a labeling function on the states of K. We require that for each s € S, there 
exists s’ € S, such that (s, s’) € ô. 

A path of a Kripke structure K is an infinite sequence of states s(0)s(1)--- € 
S”, such that s(0) = Sinit and (s(2), s(i + 1)) € ô, for alli > 0. A trace of K isa 
sequence o(0)o(1)o(2)--- € E“, such that there exists a path s(0)s(1)--- € S® 
with a(i) = L(s(i)) for all i > 0. We denote by Traces(K, s) the set of all traces 
of K with paths that start in state s € S. 

The directed graph F = (S,0) is called the Kripke frame of the Kripke 
structure KC. A loop in F is a finite sequence 8981 +--+ Sn, such that (si, Si+1) € 6, 
for all 0 < i < n, and (8,, 80) € 6. We call a Kripke frame acyclic, if the only 
loops are self-loops on terminal states, i.e., on states that have no other outgoing 
transition. Acyclic Kripke structures model terminating programs. 
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Extended A-HLTL. The syntax of extended A-HLTL is: 


=4dr.y | Yr. | ET. | AT. | Y 
Y = true | ar |Y | Y1 V Y2 | Y1 A Y2 | Yı U Yo | Y1 Rye 


aS) 
ii 
W 


where a € AP, 7 is a trace variable from an infinite supply V of trace variables, 
T is a trajectory variable from an infinite supply J of trajectory variables (see 
formula Yni in Section 1 for an example). The intended meaning of ar, is 
that proposition a € AP holds in the current time in trace m and trajectory T 
(explained later). Trace (respectively, trajectory) quantifiers 3r and Vz (respec- 
tively, Er and A7) allow reasoning simultaneously about different traces (respec- 
tively, trajectories). The intended meaning of E is that there is a trajectory that 
gives an interpretation of the relative passage of time between the traces for 
which the temporal formula that relates the traces is satisfied. Dually, A means 
that all trajectories satisfy the inner formula. Given an A-HLTL formula y, we 
use Paths(y) (respectively, Trajs(y)) for the set of trace (respectively, trajectory) 
variables quantified in y. A formula y is well-formed if for all atoms ar, in y, 
m and T are quantified in ọ (ie., r € Trajs(y) and m € Paths(w)) and no tra- 
jectory/ trace variable is quantified twice in P: We use the usual syntactic sugar 
false = —true, and Oy = true Uy, p1 > y2 Ê ny1 V p2, and Oy Ê 3}-y, ete. 
We choose to “dd R (release) and A to the logic to enable negation normal form 
(NNF). As our BMC algorithm cannot handle formulas that are not invariant 
under stuttering, the next operator is not included. 


Semantics. A trajectory t : t(0)t(1)t(2)--- for a formula y is an infinite sequence 
of subsets of Paths(y), i.e., each t; C Paths(y), for all i > 0. Essentially, in each 
step of the trajectory one or more of the traces make progress or all may stutter. 
A trajectory is fair for a trace variable m € Paths(y) if there are infinitely many 
positions j such that m € t(j). A trajectory is fair if it is fair for all trace 
variables in Paths(y). Given a trajectory t, by tt, we mean the suffix t(i)t(i + 
1)---. Furthermore, for a set of trace variables V, we use TRJy for the set 
of all fair trajectories for indices from V. We also use a trajectory assignment 
I’: Trajs(y) + TRJpom(r), where Dom(I’) is the subset of Trajs(p) for which 
I’ is defined. Given a trajectory assignment I, a trajectory variable 7, and a 
trajectory t, we denote by I'[r + t] the assignment that coincides with I for 
every trajectory variable except for T, which is mapped to t. 

For the semantics of extended A-HLTL, we need asynchronous trace assign- 
ments IT : Paths(y) x Trajs(y) —> T x N which map each pair (7,7) formed by a 
path variable and trajectory variable into a pointed trace. Given (H, I") where 
II is an asynchronous trace assignment and I’ a trajectory assignment, we use 
(II, T) +1 for the successor of (II, I) defined as (II’, I’) where I(r) = I'(r)!, 
and II'(a,7) = H(a,7) +1 ifm € P(r)(0) and I'(a,7) = H(r,T) otherwise. 
Note that JI can assign the same r to different pointed traces depending on the 
trajectory. We use (H, T) + k as the k-th successor of (I, T). Given an asyn- 
chronous trace assignment JT, a trace variable 7, a trajectory variable 7 a trace 
c, and a pointer p, we denote by II[(z, 7) +> (a, p)]| the assignment that coincides 
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h=0 h=0 
1=1 1=0 
obs=‘a’ obs=‘a’ 
h=0 h=0 
—| 1=0 1=0 
obs=‘a’ obs=‘b? 
h=1 h=0 
1=1 1=0 
obs=‘b’ obs=‘b 


Fig. 4: Kripke structure K and traces tı and t2 of K, K H yu, but K F yn. 


with I for every pair except for (7,7), which is mapped to (ø, p). The satisfac- 
tion of an A-HLTL formula y over a trace assignment I, a trajectory assignment 
I’, and a set of traces T is defined as follows (we omit 7, A and V which are 
standard): 


I, T) |r 3r. iff for ssme o ET : 
(IH[(m,T) + (0,0)], T) Er ¢ for all T 
IT) Hr Yr.p iff fralo eT: 
(IH[(m,T) + (0,0)], T) Er ¢ for all r 
I, T) Hr Er.w iff for some t € TRJ pomir) : (H, LP [t+ t) Fw 
I, T) Hr Arab iff for all t € TRJ pom (IT, I [r > t)) Ev 
IT) Ont iff a € o(n) where (o,n) = H(z,7) 
I, T) |= i Uy iff Dreom 20 ULTI as and 
for all j <i: (H, r) +4 Ew 
I, T) H= yi Ryo iff for all i > 0: (MH, T) +i H %2, or 


for some i > 0: (MH, T) +i H yı and 
for all j < i: (IH, r) +j H y2 


We say that a set T of traces satisfies a sentence y, denoted by T = ọ, if 
(Io, Tg) Er p. We say that a Kripke structure K satisfies an A-HLTL formula » 
(and write K — ọ) if and only if we have Traces(K, Sini) H| p. An example is 
illustrated in Fig. 4. 


3 Bounded Model Checking for A-HLTL 


We first introduce the bounded semantics of A-HLTL (for at most one trajec- 
tory quantifier alternation but arbitrary trace quantifiers) which will be used to 
generate queries to a QBF solver to aid solving the BMC problem. The main 
result of this section is Theorem 1 which provides decision procedures for model 
checking A-HLTL for terminating systems. 


3.1 Bounded Semantics of A-HLTL 


The bounded semantics corresponds to the exploration of the system up to a 
certain bound. In our case, we will consider two bounds k and m (with k < m). 
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The bound k corresponds to the maximum depth of the unrolling of the Kripke 
structures and m is the bound on trajectories length. We start by introducing 
some auxiliary functions and predicates, for a given trace assignment and (H, I’). 
First, the family of functions pos, , : {0...m}— N. The meaning of pos, , (i) 
provides how many times 7 has been selected in {7(0),...,7(¢)}. We assume that 
Kripke structures are equipped with an atomic proposition halt (one per trace 
variable 7) which encodes whether the state is a halting state. Given (J, I") we 
consider the predicate halted that holds whenever for all r and 7, halt € o(j) for 
(o, j) = H(n,T). In this case we write (I, In) — halted. 

We define two bounded semantics which only differ in how they inspect be- 


yond the (k,m) bounds: ires called the halting pessimistic semantics and 


TE called the halting optimistic semantics. We start by defining the bounded 


semantics of the quantifiers. 


(,L,0) Erm IT. wb iff there is a ø € Tr, such that for all 7 

(H|, T) => (0,0)], T, 0) Fem Y (1) 
(1,T,0) Frm YT. Y iff for allo € Tp, for all 7 : 

(H|, T) > (¢,0)],P,0) Fem Y (2) 
(I, T, 0) Erm ET. % iff there is a t € TRJ Dom(17) : 

(U, Tr > t],0) Fem Y (3) 
(II, T, 0) =k m AT. % iff for all t € TRJ pom(17) : 

(M, T[r > t], 0) C k,m Y (4) 


For the Boolean operators, for i < m: 


IIT, i) Frm true (5) 
ITAN Ek m air iff a € (o, j) where 

(a,j) = H (7, T)(i) and j < k (6) 
M Li) Ekm kr iff a ¢ (o, j) where 

(a, j) = H (7, T)(i) and j < k (7) 
II, Ti) S k,m wy V we iff (II, I,i) F k,m wy or (II, Ti) F-k,m we (8) 
II, Ti) F k,m wy A we iff (II, Ti) F k,m pi and (II, T, i) Ekin we (9) 


For the temporal operators, we must consider the cases of falling of the 
paths (beyond k) and falling of the traces (beyond m). We define the predicate 
off which holds for (I, T,i) if for some (7,7), pos, (i) > k and halt, ¢ a(k) 
where ø is the trace assigned to m. Note that halted implies that off does not 
hold because all paths (including those at k or beyond) satisfy halt. 

We define two semantics that differ on how to interpret when the end of 
the unfolding of the traces and trajectories is reached. The halting pessimistic 
semantics, denoted by //?°* take (1)-(9) above and add (10)-(13) together with 
(11,T,i) Kem off. Rules (10) and (11) define the semantics of the temporal 
operators for the case i < m, that is, before the end of the unrolling of the 
trajectories (recall that we do not consider ©): 


(II, T,i) Ekm pı u We iff (II, T, i) k, m Wo, or (II, 1 i) HEko Wi; and 
(I, T,i) +1 Fam Y1 U we (10) 
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(M, T, i) Frm Y Ry. if (M, T,i) em Y2, and (M, T, i) Frm Y1, or 
(II, T,i)+1 Ekm pi R we (11) 


For the case of i = m, that is, at the bound of the trajectory: 


(IT, T, m) Haree pı u we iff (II, I, m) Hk,m p2 (12) 
(I, T, m) EE Ryp if (M, T, m) Kem V1 A Yo, or 

(I, T, m) Frm halted ^ Y2 (13) 

The halting optimistic semantics, denoted by an take rules (1)-(11) and 

(12')-(13'), but now if (17,T,i) gZ! off then (17,T,i) Hg p holds for ev- 


ery formula. Again, rules (10) and (11) define the semantics of the temporal 
operators for the case i < m. Then, for i = m: 


(H, T, m) HRE pU bo iff (I, T, m) Fem V2, or 
(I, T, m) Arm halted \ pı (12’) 
(II, T, m) HRE p Reve iff (T, T, m) Fem v2 (13’) 


Similar to [15] for the case of HyperLTL, the pessimistic semantics capture 
the case where we assume that pending eventualities will not become true in 
the future after the end of the trace (this is also assumed in LTL BMC). Dually, 
the optimistic semantics assume that all pending eventualities at the end of the 
trace will be fulfilled. Therefore, the following hold (proofs in [13]). 


Lemma 1. Let k < k andm < m’. 
1. If D0) EE? p, then (I, 1,0) EPS, p. 
2. If (T, T, 0) Apo? p, then (I, T, 0) Eei, o. 


Lemma 2. The following hold for every k and m, 
1. If (I, T,0) Fy p, then (I, T,0) Fy. 
2. FLT p then T0) A: 


3.2 From Bounded Semantics to QBF Solving 


Let K be a Kripke structure and y be an A-HLTL formula. Based on the bounded 
semantics introduced previously, our main approach is to generate a QBF query 
(with bounds k, m), which can use either the pessimistic or the optimistic se- 
mantics. We use [K, y] fp if the pessimistic semantics are used and [K, ohn 


if the optimistic semantics are used. Our translations will satisfy that 


(1) if [K, o] is SAT, then K E g; 

(2) if IX, p] Zz?" is UNSAT, then K 9; 

(3) if the Kripke structure is unrolled to the diameter and the trajectories up 
to a maximum length (see below), then K phere is SAT if and only if 
IK, gon" is SAT. 
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The first step to define [K, lier t and [K, gre “$ is to encode the unrolling of 
the models up-to a given depth k. For a path variable corresponding to Kripke 
structure K, we introduce (k + 1) copies (x°,...,a*) of the Boolean variables 
that define the state of K and use the initial condition J and the transition 
relation R of K to relate these variables. For example, for k = 3, we unroll the 
transition relation up-to 3 as follows: 


IK] = (2°) A R(x°, 21) A R(a', x°) A R(x”, x°). 


Encoding positions. For each trajectory 


variable 7 and given the bound m on the Encodings of tf and #7: 
unrolling of trajectories, we add Paths(y) x et) , 22 ia tt t5 t8] 
(m + 1) variables t? ...t™, for each m. The | [¢2,,¢1,,¢7,,¢3,,¢4,,22,,¢9)] 


J and pos.) 


intended meaning of tf is that tł is true 
whenever m €E t(j), that is, when t dictates 


Encodings of pos 


0,0 0,1 0,2 0,3 
os os , pos , pos j 
that m moves at time instant j. In order | P weg P ge PO mh PO mat 
; ve : os’ os I POS st 1 POS, is 
to encode sanity conditions on trajectories, | "° pg? ° rr) PO mr Pn 
‘ ee Os 4 POS. 5 , pos? pr 1 POS) AAT 
that are crucial for completeness, it is neces- POS zt TP 7 i 
’ pos? , pos?" post? poe 6] 
sary to introduce a family of variables that E DRE T PES AE 
0,0 0, 


captures how much 7 has moved according [Posy mp PN al POS, , Posat, n 
to 7 after 7 steps. There is a variable pos DOS hgh POST ah i POSi , POR n 
for each trace variable 7, each trajectory T Y POR at 
and each i < k and j < m. We represent m,r! POS TY ay 
this variable by pos':/,. The intention is that 

pos is true whenever after j steps trajectory Fig.5: Variables for encodings of 
T has dictated that trace m progresses pre- the blue trajectory in Fig. 4, 
cisely i times. Fig. 5 shows encodings tł and where green variables are true and 
pos). for the traces w.r.t. the blue trajec- gray variables are false. 

tory, 7’ in Fig. 4. We will use the auxiliary 

definitions (for i € {0...k} and j € {0...m}) to force that the path m has 
moved to position 7 after j moves from the trajectory and that 7 has not fallen 
off the trace (and does not change position when the paths fall off the trace): 


2 3 
pos? ae , pos” z1 > POS 


3,3 3,4 
pos’; „7 > POs), POS 


jj def ; 
setpos 7). = posg a A i pos a A nofi r 


nE{0..k} {i} 
noposi , £ FA \ spose) 
n€{0..k} 
Initially, Tpos e An.r setpos %9, where m € Traces(y) and + € TRJIpom(m)- 


Inos Captures that all "paths are initially at position 0. Then, for every step 
j € {0...m}, the following formulas relate the values of pos and off, depending 
on whether trajectory 7 moves path 7 or not (and on whether m has reached the 
end k or halted): 


step). > d \ (posti, Ath > setpos'*) ae) 
i€{0..k—1} 
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stutters4. , = \ (posi! A att, + setpos':i**) 
i€{0..k} 


ends? , d (posk) A th) + — ((“halt¥ — noposi +") A (halt® —+ setpos AITEN 


Then the following formula captures the correct assignment to the the pos vari- 
ables, including the initial assignment: 


Ppos = Tpos ^ \ N (stept = A stutters), „ ^ ends}. ,) 


jE{0..m} 7,7 


For example, mig. 5 (w.r.t. Fig. 4) encodes the blue trajectory (r’) of m 
(i.e., t1) and a’ (ie., t2) as follows. First, for Í E [0,3), it advances tı 
and stutters tə. Therefore, t?,t1,t2 are true and t? t2, are false. Notice 


TIT Te nw) tres T ; 
that for pos eo the m position advances according to step? z (i.e., 
pos,” ,, POSK pn pos.’ r POS ); 
pos, pos”, p pos” ep pos.’ w) Par for j € [3,5], it alternatively advances 
t2 which makes t3, t4, t3 false and t3,,t4,, t5, true. Similarly, the movements be- 


3,4 35 3,6 wie 2,5 
comes pos,’ zr pos,’ r pos, ,, and pos, 1, POSE or, pos rt At the halting point 


while 7’ stutters according to stutters!, po he, 


(ie, j = k), both trajectory trigger endsJ and do not advance anymore. 


Encoding the inner LTL formula. We will use the following auxiliary predicates: 


halted? $% N halted? off E V off. 


We now give the encoding for the inner temporal formulas for a fix unrolling k 
and m as follows. For the atomic and Boolean formulas, the following translations 
are performed for j € {0...mb}. 


[Perla = Victo. k} (posted n A pr) ( 
[Pe lim = Viego..ny (poszt, A =p) ( 
[v1 V palk, m = $ KAR „m V [yl „m ( 
[vi A Volg m = DAH mA AH wi ( 


The halting pessimistic semantics translation uses [-Japes, taking (14)-(17) 
and (18)-(21) below. For the temporal operators and j < m: 


DU boli m=O? A (Malh m V (Yili A ei U volt) (18) 

1 R bali =f) A (oli m A a m V Feb R dels) (19) 
For j = m: 

pi U dele =lalen (20) 

wi R polk m= (Wilk ^ lkn) V (halted A [rb] m) (21) 
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The halting optimistic semantics translation uses [-];op:, taking (14)-(17) 
and (18’)-(21’) as follows, For the temporal operators and j < m: 


p U delim =off? V (ioli m V Malh A [er U bel) (18°) 

Wi R poli m=O? V (alh m A Aali m V ia R volt) (19°) 
For j = m: 

p U polk n=l V (hated ™ A [die 2) (20') 

hi R palim =WM]km (21’) 


Combining the encodings. Let p be a A-HLTL formula of the form 
P = QATA... QzTZQaTa..... Q.7z.w. Combining all the components, the en- 
coding of the A-HLTL BMC problem into QBF, for bounds k and m is: 


[K, ylk,m — QATA. od Q257-Qata- a .Qztz. Jpos. Joff. 
(IK]k 0a ++- [Kle oz (Ppos ^ enet) 


where og => if Q4 = V (and oy =A if Qa = J), and og, ... are defined 
similarly. The sets pos is the set of variables pos}, that encode the positions 


and off is the set of variables offi, that encode when a trace progress has 
fallen off its unrolling limit. We next define the encoding enc(w) of the temporal 
formula w. 


Encoding formulas with up to 1 trajectory quantifier alternations We consider 
the encoding into QBF of formulas with zero and one quantifier alternation 
separately. In the following, we say that at position 7 a collection of trajectories 
U “moves” whenever either all trajectories have moved all their paths to the 
halting state, or at least one of the trajectories in U makes one of the non-halted 
path move at position j. Formally, 


moves), halted, v VV (t2 A shalt?) 
TEU, ™ 
— E*+U.4%: In this case, the formula generated for enc(w) is 
( N movesi) A [Dem 
JE{0...m} 


This is correct since the positions at which all trajectories stutter all paths 
can be removed (obtaining a satisfying path), we can restrict the search to 
non-stuttering trajectory steps. 

— AtU.4): In this case, the formula generated for enc(y) is 


( N movesi) > [Dim 
JE{0...m} 


The reasoning is similar as the previous case. 
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— AtU,E*Ug.: In this case, the formula generated for enc(w) is 


( VAN movesi, ) > ( \ (halted?, > moves?,..) A [TR m) 
jE{0...m} JE{0...m} 


Universally quantified trajectories must explore all trajectories, which must 
be responded by the existential trajectories. Assume there is a strategy for 
Ug for the case that universal trajectories U4 never stutter at any position. 
This can be extended into a strategy for the case where U 4 can possible stut- 
ter, by adding a stuttering step to the Ug trajectories at the same position. 
This guarantees the same evaluation. Therefore, we restrict our search for 
the outer U4 to non-stuttering trajectories. Finally, Ug is obliged to move 
after U4 has halted all paths to prevent global stuttering. 

EtUgA+tU 4.4: In this case, the formula generated for enc(y) is similar, 


( VAN movesi; ) A ( VAN (halted}, > moves}, ) > [YD Rm) 
jE{0...m} jE{0...m} 


The rationale for this encoding is the following. It is not necessary to explore 
a non-moving step j for the existentially quantified trajectories Ug because 
if this stuttering step is successful it must work for all possible moves of 
the U4 trajectories at the same time step j. This includes the case that all 
trajectories in U4 make all paths stutter (which, if we remove j one still 
has all the legal trajectories for U4). Since the logic does not contain the 
next operator, the evaluation for the given Ug and one of the trajectories 
for U4 that stutter at j will be the same as for j + 1 for all logical formulas. 
Therefore, the trajectory that is obtained from removing step j from Ug is 
still a satisfying trajectory assignment. It follows that if there is a model 
for Ug there is a model that does not stutter. Finally, after all paths have 
halted according to the Ug trajectories, a step of Ua that stutters all paths 
that have not halted can be removed because, again the evaluation is the 
same in the previous and subsequent state. It follows that if the formula has 
a model, then it has a model satisfying the encoding. 


Theorem 1. Let vy be an A-HLTL formula with at most one trajectory quantifier 
alternation, let K be the maximum depth of a Kripke structure and let M = 
K x |Paths()| x |Trajs(y)|. Then, the following hold: 


IK, oK is satisfiable if and only if K y. 
K, plicnr is satisfiable if and only if K E 9. 
K,M 


Theorem 1 (proof in [13]) provides a model checking decision procedure. An al- 
ternative decision procedure is to iteratively increase the bound of the unrollings 
and invoke both semantics in parallel until the outcome coincides. 


A 


Complexity of A-HLTL Model Checking for Acyclic 
Frames 


Our goal in this section is to analyze the complexity of the A-HLTL model checking 
problem in the size of an acyclic Kripke structure (all proofs in [13]). 
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Problem Formulation. We use MC[Fragment] to distinguish different varia- 
tions of the problem, where MC is the model checking decision problem, i.e., 
whether or not K = y, and Fragment is one of the following for ọ: 

— ‘[A(4/V)+A/E]*’, for k > 0, is the fragment with a lead existential trace 
quantifier, one outermost universal or existential trajectory quantifier, and 
k (counting all) quantifier alternations, where k = 0 means the existential 
alternation-free fragment ‘JTE’, Fragment ‘[V(V/4)+A/E]*’ is defined sim- 
ilarly, where k = 0 is the universal alternation-free fragment ‘YTA’. 

— Fragments ‘[4(4/V)*+ (E+ At /AtE* /EEt+ /AAt)]*’, for k > 1 denotes the frag- 
ment with a lead existential trace quantifier, multiple outermost trajectory 
quantifiers with at most one alternation, and k quantifier alternations (count- 
ing all quantifiers), where k = 1 means fragment ‘JEA’. Fragment ‘[V(V/3)> 
(E+A* /A+E+/EE+ /AA*)]*’ is defined similarly, where k = 1 means frag- 
ment ‘VAE’. 


The Complexity of A-HLTL Model Checking. We first show the A-HLTL 
model checking problem for the alternation-free fragment with only one trajec- 
tory quantifier is NL-complete. For example, verification of information leak in 
speculative execution in sequential programs renders a formula of the form V4A, 
which belongs to the alternation-free fragment (more details in Section 5). 


Theorem 2. MC[3+tE] and MC[V*A] are NL-complete. 


We now switch to formulas with alternating trace quantifiers. The significance 
of the next theorem is that a single trajectory quantifier does not change the 
complexity of model checking as compared to the classic HyperLTL verification [2]. 
It is noteworthy to mention that several important classes of formulas belong 
to this fragment. For example, according to Theorem 3 while model checking 
observational determinism [20] (VVE), generalized noninference [16] (VVSE), and 
non-inference [5] (VIE) with a single initial input are all coNP-complete. 


Theorem 3. MC[3(a/V)*(A/E)|* is 2?-complete and MC[v(v/3)+(E/A)]* is IP- 
complete in the size of the Kripke structure. 


We now focus on formulas with multiple trajectory quantifiers. We first show 
that alternation-free multiple trajectory quantifiers bumps the class of complex- 
ity by one step in the polynomial hierarchy. 


Theorem 4. MC[3(a/V)*EE*]* is X? -complete and MC[V(V/3)TAA* |* is HP,- 
complete in the Kripke structure. 


Theorem 5. For k > 1, MC[S(/V)tATET]* is X? -complete and 
MC[v(v/a)*E*A*]* is IH? ,-complete in the size of the Kripke structure. 


Finally, Theorems 3, 4, and 5 imply that the model checking problem for 
acyclic Kripke structures and A-HLTL formulas with an arbitrary number of 
trace quantifier alternation and only one trajectory quantifier is in PSPACE. 
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5 Case Studies and Evaluation 


We now evaluate our technique. The encoding in Section 3 is implemented on 
top of the open-source bounded model checker HYPERQB [15]. All experiments 
are executed on a MacBook Pro with 2.2GHz processor and 16GB RAM (https: 
//github.com/TART-MSU/async_hltl_tacas23). 


Thread T1() { 
while (true){ 
x= 0; 


Non-interference in Concurrent Programs. We first 
consider the programs presented earlier in Figs. 1 and 3 


= 1) then 


together with A-HLTL formulas yy; and Yni from Sec- ; cL 
tion 1. We receive UNSAT (for the original formula and - P 
d y := 1; 

aR 


not its negation), which indicates that violations have » 
been spotted. Indeed, our implementation successfully i> EN 
finds a counterexample with a specific trajectory that while (frue) { 
prints out ‘acdb’ when the high-security value h is equal ! A 

to zero (entries of ACDB and ACDBnaet in Table 3). Our | 

other experiment is an extension of the example in [10] $ “Gy?! 

for multiple asynchronous channels (see Fig. 6) and the = 3° 1" 
following formula: yop,, = Va.Va'.Ar. Er’. O (lr, © iai 

lx) > O (obsr, + obs,,-/). The results for this case Fig. 6: Program with 
are entries of ConcLeak and ConcLeaknaet in Table 3. De- nondeterministic 
tails of the counterexample can be found in [13]. sequence of inputs. 


D TOR 


Speculative Information Flow. Speculative execution is a standard optimiza- 
tion technique that allows branch prediction by the processor. Speculative non- 
interference (SNI) [9] requires that two executions with the same policy p (i.e., 
initial configuration) can be observed differently in speculative semantics (e.g., 
a possible branch), if and only if their non-speculative semantics with normal 
condition checks are also observed differently; i.e., the following A-HLTL formula: 


yen = Vag. Val Woh Ar.( D(obss, + © obsz, +) A 
= — x — 


speculative nonspeculative 


(Pri,r > Prro,r) A (Pay ,7 o> Pair) A (Pra,7 o> Pri.r)) >O (obs x17 oO obs.) 


where obs is the memory footprint, traces mı and m2 range over the (nonspecu- 
lative) C code and traces 7; and 7 range over the corresponding (speculative) 
assembly code. We evaluate SNI on the translation from a C program (details 
in [13]), where y is the input policy p and multiple versions of x86 assembly 
code [9]. The results of model checking speculative execution are in Table 3 (see 
entries from SpecExcuy;,, to SpecExcuy,,). Additional versions from SpecExcuy; 
to SpecExcuy,, are under different compilation options. Our method correctly 
identify all the insecure and secure ones as stated in [9]. 


Compiler Optimization Security. Secure compiler optimization [17] aims at 
preserving input-output behaviors of a source program (original implementation) 
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and a target program (after applying optimization), including security policies. 
We investigate the following optimization strategies: Dead Branch Elimination 
(DBE), Loop Peeling (LP), and Expression Flattening (EF). To verify a secure 
optimization, we consider two scenarios: (1) one single I/O event (one trajectory, 
similar to [1]), and (2) a sequences of I/O events (two trajectories): 


gsc = Wa.Vn'.Er. (ing, © inv) > O (out, ++ outy) 


Psc = YT.YT' AT. ET.O (inr, © inv) > O (outre © outw r), 


where in is the set of inputs and out is the set of outputs. Table 3 (cases 
DBE — EFLPyaet) shows the verification results of each optimization strategy and 
different combination of the strategies (details in [13]). 


Cache-Based Timing Attacks. Asynchrony also leads to attacks when system 
executions are confined to a single CPU and its cache [18]. A cache-based tim- 
ing attack happens when an attacker is able to guess the values of high-security 
variables when cache operations (i.e., evict, fetch) influence the scheduling of 
different threads. Our case study is inspired by the cache-based timing attack 
example in [18] and we use the formula of observational determinism Yop,, in- 
troduced earlier in this section to find the potential attacks (see cases of CacheTA 
and CacheTAnaet in Table 3 with details in [13]). 


5.1 Analysis of Experimental Results 


Table 3 presents the diameter of the transition relation, length of trajectories 
m, state spaces, and the number of trajectory variables. We also present the 
total solving time of our algorithm as well as the break down: generating mod- 
els (genQBF), building trajectory encodings (buildTr), and final QBF solving 
(solveQBF). Our two most complex cases are concurrent leak (ConcLeaknaet ) 
and loop peeling (LPpaet). For concurrent leak, it is because there are three 
threads with many interleavings (i.e., asynchronous composition), takes longer 
time to build. For loop peeling, although there is no need to consider interleav- 
ings except for the nondeterministic inputs; however, the diameters of traces 


(Dx,, Dx.) are longer than other cases, which makes the length and size of 
trajectory variables (i.e., m and |T|) grow and increases the total solving time. 
Our encoding is able to handle a 

r : MCHyper [1] This paper 
variety of cases with one or more 

3 g d di hethèr Case Total[s]||genQBF/ buildTr/ solveQĝBF[s]||'Total[s] 
trajectories, epen mg on. w. e j er DBE 0.8 0.9 / 0.07 / 0.01 0.98 
multiple sources of non-determinism LP 365.9 1.37 / 1.40 / 1.13 3.90 
is present. To see efficiency, we com-  |EFLP 1315.2 5.11 / 8.12 / 9.35 22.58 


pare the solving time for cases of 
compiler optimization with one tra- 
jectory with the results in [1]. This 
method reduces A-HLTL model checking to HyperLTL model checking for limited 
fragments and utilizes the model checker MCHyper. On the other hand, we di- 
rectly handle asynchrony by trajectory encoding. Table 2 shows our algorithm 
considerably outperforms the approach in [1] in larger cases. 


Table 2: Comparison of model checking 
compiler optimization with [1]. 
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(model checking spec and data) (time took for solving) 

Models yg ||Dx, Dez m |Sx,||Skz| |T|  QBF ||genQBF[s]|buildTr[s]|solveQBF(|s]|| Total|s| 
ACDB Yn 6 6 12 109 109 1378 UNSAT 2.80 0.32 0.23 3.35 
ACDBnaet PNIng 8 8 16 696 696 2754 UNSAT 7.74 2.54 3.73 14.01 
ConcLeak Yoo 11 11 22 597 597 6118 UNSAT 14.85 7.10 8.29|| 30.24 
ConcLeaknaet POD |} 18 18 36 2988 2988 22274 UNSAT 127.09 53.14 731.48|| 911.72 
SpecExcu,,, PSNI 3 6 9 132 340 1112 UNSAT 7.45 1.72 3.07 12.24 
SpecExcuy PSNI 3 6 9 144 168 1112 SAT 5.61 1.28 2.44 9.33 
SpecExcuy,, Ysni 3 6 9 87 340 636 UNSAT 7.30 1.68 2.97 11.95 
SpecExcu,,, YSN 3 6 9 93 340 636 UNSAT 7.37 1.71 4.50 13.58 
SpecExcuys, PSNI 3 6 9 132 168 636 SAT 6.23 1.23 3.48 10.94 
SpecExcuy,, Ysni 3 7 10 132 340 766 UNSAT 7.47 1.82 3.26 12.55 
SpecExcuy,,, PSNI 2 5 7 144 168 352 SAT 5.83 1.28 2.58 9.69 
DBE Psc 4 4 8 8 6 546 SAT 0.9 0.07 0.01 0.98 
DBEndet LSCug 3. 13 26 82 72 9414 SAT 1.60 0.56 9.61 11.77 
DBEndet w/ bugs PSCna 3 13 26 82 72 9414 UNSAT 1.36 0.49 2.05 3.90 
LP Psc 22 22 44 80 76 3870 SAT 1.37 1.40 1.13 3.90 
LPnaet PSCug 7 17 34 558 811 19110 SAT 7.37 3.86 48.15 59.38 
LPnaet w/ loops Yscnq || 33 35 68 757 1591 128114 SAT 30.52 34.99 4165.54|/4231.05 
LPnaet w/ bugs PSCna 7 17 34 558 661 19110 UNSAT 6.51 3.60 20.75 30.86 
EFLP Psc 32 32 64 80 248 108290 SAT 5.11 8.12 9.35 22.58 
EFLPnaet PSCna 8 22 40 582 1729 28986 SAT 15.92 8.90 35.48|| 160.30 
EFLPnaet w/ loops PSC || 33 45 78 295 1996 178894 SAT 36.98 62.89 21.60|| 221.47 
CacheTA Yoo D 13 26 48 48 9414 UNSAT 1.49 0.53 0.38 2.40 
CacheTAndet LOD |} 58 58 16 16 32 16258 UNSAT 1.95 1.33 1.02 4.30 
CacheTAndet w/ loops}Yop,4|| 35 35 70 88 88 139302 UNSAT 5.50 27.65 25.92|| 159.07 

Table 3: Case studies break down for Kripke structures: 11, K2 (all case studies 


have two, e.g.,one for high-level and one for assembly code), formula: p, diameter: 
D, state space: |S|, trajectory depth: m, and size of trajectory variables: |T]. 


6 Conclusion and Future Work 


In this paper, we focused on the problem of A-HLTL model checking for terminat- 
ing programs. We generalized A-HLTL to allow nested trajectory quantification, 
where a trajectory determines how different traces may advance and stutter. We 
rigorously analyzed the complexity of A-HLTL model checking for acyclic Kripke 
structures. The complexity grows in the polynomial hierarchy with the number 
of quantifier alternations, and, it is either aligned with that of HyperLTL or is 
one step higher in the polynomial hierarchy. We also proposed a BMC algorithm 
for A-HLTL based on QBF-solving and reported successful experimental results 
on verification of information flow security in concurrent programs, speculative 
execution, compiler optimization, and cache-based timing attacks. 


Asynchronous hyperproperties enable logic-based verification for software 
programs. Thus, future work includes developing different abstraction techniques 
such as predicate abstraction, abstraction-refinement, etc, to develop software 
model checking techniques. We also believe developing synthesis techniques for 
A-HLTL creates opportunities to automatically generate secure programs and 
assist in areas such as secure compilation. 
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Abstract. We consider linear dynamical systems under floating-point 
rounding. In these systems, a matrix is repeatedly applied to a vector, 
but the numbers are rounded into floating-point representation after each 
step (i.e., stored as a fixed-precision mantissa and an exponent). The 
approach more faithfully models realistic implementations of linear loops, 
compared to the exact arbitrary-precision setting often employed in the 
study of linear dynamical systems. 

Our results are twofold: We show that for non-negative matrices there is a 
special structure to the sequence of vectors generated by the system: the 
mantissas are periodic and the exponents grow linearly. We leverage this 
to show decidability of w-regular temporal model checking against semi- 
algebraic predicates. This contrasts with the unrounded setting, where 
even the non-negative case encompasses the long-standing open Skolem 
and Positivity problems. 

On the other hand, when negative numbers are allowed in the matrix, we 
show that the reachability problem is undecidable by encoding a two-counter 
machine. Again, this is in contrast with the unrounded setting where point- 
to-point reachability is known to be decidable in polynomial time. 


Keywords: Model Checking - Floating-point - Dynamical Systems. 


1 Introduction 


Loops are a fundamental staple of any programming language, and the study 
of loops plays a pivotal role in many subfields of computer science, including 
automated verification, abstract interpretation, program analysis, semantics, etc. 
The focus of the present paper is on the algorithmic analysis of simple (i.e., non- 
nested) linear (or affine) while loops, such as the following: 


* A long version of this paper is available as [19]. 
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x=3, y=4,z=2 

while x+3y+z > 4: 
x = 3x +2z 
y= 3x+y 
Zz=yt+z 


We are interested in analysing how the loop evolves. A simple reachability 
query is to decide whether the loop variables ever satisfy a Boolean combination 
of polynomial inequalities, for example modelling a loop guard. More generally, 
one might seek to consider significantly more complex temporal properties, such 
as those expressible in linear temporal logic or monadic second-order logic: this 
gives rise to a model-checking problem. 

Modelling the evolution of such a loop may require unbounded memory. That 
is, the number of bits needed to represent the numbers z, y, and z may grow 
larger and larger. However, most computer systems do not represent rational 
numbers to arbitrary precision, but rather use floating-point rounding, in which a 
number y is stored using two components: the mantissa m € Q and the exponent 
a € Z, such that y = m- 10%.° 

Typically floating-point numbers are specified using either 32 or 64 bits, with 
some of these reserved for the mantissa and some for the exponent, thus bounding 
both the mantissa and the exponent. We do not do this, and only place a 
bound on the number of bits representing the mantissa, allowing the exponent to 
grow unboundedly (in either direction). From a theoretical standpoint, bounding 
the number of bits of both the mantissa and the exponent would necessarily give 
rise to a finite-state system, for which essentially any decision problem would 
become decidable (at least in principle, if not necessarily in practice). Due to 
the unboundedness of exponents in our setting, we do not have to consider 
overflows (‘NaN’, ‘infinity’ or ‘-infinity’ which are part of most floating-point 
specifications). 

Formally, we model our programs using linear dynamical systems (LDS), 
which comprise a starting vector representing the initial state of each variable 
and a matrix describing the evolution of the program. An LDS generates an 
infinite sequence of vectors (the orbit of the system) by multiplying the matrix 
with the current vector and then applying floating-point rounding to the result. 


Our results 


We consider the model-checking problem for linear dynamical systems evolv- 
ing under floating-point rounding. More formally, let Y;,...,¥, C R? be semi- 
algebraic targets. Given an orbit («)scn, we define the characteristic word 
w = W1,W2,W3,... with respect to Y1,..., Yp over alphabet 2t!-*} such that 
i € w if and only if « € Y;. The model-checking problem asks whether w is in 
an w-regular language, or equivalently satisfies a temporal specification given in 
monadic second-order logic (MSO). 


6 We work in base 10 throughout for simplicity of exposition. All our results carry over 
mutatis mutandis in any integer base, including base 2 as typically used in practice. 
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Our results show that analysing LDS under floating-point rounding is neither 
clearly easier nor harder than in the standard setting (without rounding). Our 
first contribution establishes undecidability of point-to-point reachability (and 
a fortiori model checking) under floating-point rounding, a surprising outcome 
given that point-to-point reachability is solvable in polynomial time without 
rounding [16]. On the other hand, in the standard setting neither decidability 
nor undecidability are known for full model checking (although mathematical 
hardness results exist); see [24,18,17]. 


Theorem 1. The floating-point point-to-point reachability problem is undecid- 
able. 


However, for non-negative matrices, we show that the full MSO model- 
checking problem is decidable in our setting, without restrictions on the di- 
mensions of the predicates or the ambient space. This is in stark contrast to 
the standard setting, where assuming non-negativity does not simplify the prob- 
lem. Model checking non-negative LDS without rounding would require (at a 
minimum) solving the longstanding open Skolem and Positivity problems [2]. 


Theorem 2. Let (M, x) be a non-negative linear dynamical system, let Y1,..., Yp 
be semialgebraic targets and let ọ be an MSO formula using predicates over 
Yı,..., Yp. It is decidable whether the characteristic word under floating-point 
rounding satisfies ġ. 


We place no dimension restriction on the predicates; in particular, showing 
that the Skolem and Positivity problems are decidable on non-negative systems 
under floating-point rounding. At this time we do not however have complexity 
upper bounds on our model-checking algorithm, or lower bounds on the model- 
checking problem. 


Related work 


There is a line of practical tools for the analysis, verification, and invariant 
synthesis for floating-point loops [7,20,1,22]. These tools typically work well in 
practice, but do not necessarily work in all cases. The analysis of concrete im- 
plementations of floating-point specifications requires careful analysis of edge 
cases around -+too and ‘NaN’. In contrast to these tools which focus primarily on 
practical analysis, our work seeks to understand the theoretical possibilities and 
limitations of the exact analysis of (possibly long-running) floating-point loops 
in a generalised setting. 

The study of linear dynamical systems explores the sequence of vectors in- 
duced by a matrix. Model checking is only known to be decidable for certain 
classes of semialgebraic predicates—in particular those with low dimension [18] 
or for prefix-independent properties [4]; see also [17]. The well-known Skolem 
and Positivity problems being special cases of model checking, they place tech- 
nical limits on the dimensions that can be handled without first resolving long- 
standing open cases of these problems. Recent progress suggests that the Skolem 
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problem may be yet be conquered, at least for diagonalisable matrices [8,21], 
but Positivity requires solving particularly difficult problems in analytic number 
theory [24,12]. The non-negative case can be used to model sequences of distri- 
butions induced by Markov chains [6], although all hardness limitations apply 
already in the probabilistic setting [2]. 

Baier et al. [5] consider LDS under rounding to fixed-decimal precision, show- 
ing reachability is PSPACE-complete for hyperbolic systems (when no eigenvalue 
has modulus one) and decidable for certain other constrained classes of rounding. 
A notable difference of fixed-decimal precision is that it cannot allow arbitrarily 
small numbers, unlike the floating-point numbers we consider. 

A recent line of work focusses on linear dynamical systems with perturba- 
tions at every step, with a view to understanding the robustness of reachability 
problems [13,14,3]. However, unlike rounding, the perturbation is chosen in order 
to assist hitting the target and the perturbation is arbitrarily small. 

For linear while loops the reachability problem can be rephrased as a halt- 
ing problem, asking whether a guard condition is eventually met from a given 
initial state. The related termination problem asks whether a guard condition is 
met from every initial state [26,10]. Issues arising from implementations using 
floating-point representations to solve the termination problem of unrounded 
(arbitrary precision) loops are considered in [27]. In contrast, we are interested 
in analysing programs in which the intended behaviour is to round the numbers 
to fixed-precision floating-point numbers at every step of the loop. 


Organisation In Section 2, we formalise the model and problems and discuss 
some of the properties of floating-point rounding. In Section 3, we present our 
undecidability result for the general case. Finally, in Section 4 we establish some 
special periodic structure associated with the orbit and use this structure in 
Section 5 to show that model checking is decidable for non-negative LDS. 


2 Preliminaries 


2.1 Linear dynamical systems and rounding functions 


Definition 1. A d-dimensional linear dynamical system (LDS) (M,x) com- 
prises a matric M € Q?¢*4 and an initial vector x € Q?. 

Given a rounding function |] : Q? + Qt, and an LDS (M,x) the rounded 
orbit O is the sequence (2 ),en such that x© = |x] and a = [Ma@-)] for 
allt > 1. 


Given p € N, we say that a number z is a floating-point number with precision 
pif x = m -10% such that m € Q is a decimal number in {0} U [0.1,1) with p 
digits in the fractional part (after the decimal point) and a € Z. In particular, we 
associate by convention the number with mantissa m = 0 to the exponent —oo. 
Given a number x = m- 10° we define mantissa(z) = m and exponent(x) = a. 

We are interested in the floating-point rounding function [-] with precision 
p € N. Given a real number z € R, we define [x], the floating-point rounding of 
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x, as the closest floating-point number with precision p based on the first p+ 1 
digits of x. 

Where there are two possible choices, any deterministic choice that is con- 
sistent with the properties listed below is acceptable.’ We denote by FPiofp] the 
subset of Q representable in base 10 as a floating-point numbers with p digits. 
We use the following useful properties of the rounding function: 


— it is log-bounded, i.e. there exists a constant c € Ry such that Vz € R, fl < 
lell < cle. 

— it is mantissa-based, i.e. if x = 10%x', then [z] = 10°[2’]. 

— it is (p + 1)-finite, i.e. the output of the rounding is not dependent on the 
i-th digit of the mantissa, for each integer i > p+ 1. In other words, if z and 
x’ agree on the first p + 1 digits then |z] = [2’]. 

— it is sign preserving, i.e. sign(a) = sign([z]). The fact that [x] = 0 if and 
only if x = 0 also follows from the log-bounded property. 


The floating-point rounding is defined above on a single real. It is extended 
straightforwardly to a vector x by applying it to each of its components (x); 
where 7 ranges from 1 to the dimension of the vector. As such, the term [Mz] 
is obtained by first computing exactly the the vector Ma and then by rounding 
each component (Mz);. An alternative approach could be to maintain each sub- 
computation in p-bits of precision, but this is not the approach we take. Such 
an orbit can be simulated in our setting by increasing the dimension so that 
operations can be staggered in a way that at most one operation (scalar product 
or variable addition) is used in each assignment. 


2.2 Model checking 
We consider the model-checking problem of an LDS over semialgebraic sets. 


Definition 2. A semialgebraic set Y C R? is defined by a finite Boolean com- 
bination of polynomial inequalities. 


Let (M,x) be an LDS with rounded orbit O and Y = {Y1,..., Yp} be a 
collection of semialgebraic sets. The characteristic word of O is w = w,weow3... € 
(2tt---*1)Y such that j € w if and only if 2 € Y;. 

The model-checking problem asks whether the characteristic word is con- 
tained within a given w-regular language, usually specified in a temporal logic 
such as monadic second order logic (MSO), or often its LTL fragment. Without 
loss of generality we assume that the property is given as a Biichi automaton [11]. 


Problem 1 (Floating-point Model-checking Problem). Given an LDS (M, x) with 
rounded orbit O, a collection of semialgebraic sets Y = {Y1,..., Yp} and an w- 
regular specification ¢, the model-checking problem consists in deciding whether 
the characteristic word w of O satisfies the specification @. 


T For example, always rounding up, always rounding down, round to even, rounding 
towards zero, rounding away from zero are acceptable, providing the choice is fixed. 
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We will also consider the point-to-point reachability problem, which is a 
subcase of the model-checking problem (Problem 1): 


Problem 2 (Floating-point Point-to-point Reachability Problem). Given a d- 
dimensional LDS (M, x), and a target vector y € Q?, the point-to-point reacha- 
bility problem consists in deciding whether y belongs to the rounded orbit O. 


Given a target Y C R1, we associate the set of hitting times Z(Y) = {t | 
a) e Y}. Under this formulation, the reachability problem is reformulated as 
whether Z(Y) is empty. However, for model checking we will develop a more 
comprehensive understanding of the hitting times of each target Y1,..., Yq. 


2.3 Structure of M 


Formally, M is a d-dimensional matrix indexed by the elements {1,...,d}. How- 
ever, we interpret M as an automaton over states Q = {q1,..-, qa} and reference 
the entries of M by pairs of states. That is, we refer to Mq, ‚qa rather than Mj 2. 

We denote by Gm the weighted directed graph whose adjacency matrix is 
M. That is, a graph with vertices Q and with an edge from q; to q; weighted by 
Masa; if Maiq; Z 0. 

Let S1,- , Ss C Q be the strongly connected components (SCCs) of Gm. 
Our analysis will consider each strongly connected component separately, thus 
it will often be useful to consider the entries of x € FP,9[p|? corresponding 
only to one strongly connected component. Without loss of generality, by re- 
ordering the states where necessary, we assume that the states in Q are or- 
dered so that states within the same SCC appear next to one another, and the 
strongly connected components are topologically sorted, i.e. there is no edge 
from S; to S; where i > j. We split a vector x into s smaller vectors, denoted 


&g,,--.,g,, each representing the entries of x corresponding to the SCC. Let- 
ting £s; = (21,j,°+ ,2a;,j)’ and |S;| = dj, we thus have z is partitioned as 
T 
T= (21,1 ždi, * 2,8" + Zds,s) 


Moreover, for each pair of SCCs S;,.5;, we denote by Ms,,s; the submatrix 
of M restricted to the rows related to S; and columns related to S}, which is a 
matrix with d; rows and dj columns. If S; = S;, we simply write Mg,. In other 
words, Ms,,s, is the matrix that shows the dependency between S; and S}, and 


we have 


Ms,,s, Ms.,s.°°: Ms 
We say S; feeds Sj, and S; is fed by S; if there is some edge in Gm from 
some state in S; to some state in S}. 


s 


8 Note that the orientation of the edge may appear switched from the reader’s ex- 
pectation. This is due to the convention that M is pre-multiplied with x at every 
step. 
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3 Undecidability of point-to-point reachability 


In this section, we give a sketch of the proof of the undecidability of Problem 2 
(and thus of Problem 1) in the general case. The full proof can be found in the 
long version of this paper [19]. 


Theorem 1. The floating-point point-to-point reachability problem is undecid- 
able. 


This result is obtained by reduction from the termination of a two-counter 
Minsky machine. We recall the definition of this model: 


Definition 3. A two-counter Minsky machine is defined by a finite set of states 

C1,...,lm, a distinguished starting state (w.lo.g. €,), a distinguished halting 

state (w.l.o.g. lm), two natural integer counters, here denoted as x and y, and a 

mapping deterministically associating to each state transition a particular action. 
Each transition takes one of the following forms: for z € {x,y}, 


increment inc,(¢;): add 1 to counter z, move to state Lj. 
decrement dec,(£;): remove 1 from counter z if z > 0, move to state Lj. 
zero test zero?,(€;, lx): if z=0 move to state l; else move to state ly. 


The configuration of a two-counter Minsky machine consists of the current 
state and the values of x and y. 


Without loss of generality (by first using a zero test), one can assume a 
decrementation operation is never used in a configuration where the counter to 
be decreased has value 0, hence removing the need to check whether z > 0. 

The halting problem asks whether, starting in configuration (¢,,0,0), that 
is, in the distinguished starting state with both counters set to 0, whether the 
state Zm is reached. The problem is undecidable [23]. 

We build an LDS with mantissa length p = 1 and base 10 that simulates a 
run of a given Minsky machine. The reduction happens to maintain the invariant 
that each mantissa always has the value 0 or 1 after rounding (although, as we 
operate in base 10, there are 10 possible values the mantissa could have taken). 
For ease of readability, we describe this LDS using variables to represent the 
dimensions and linear functions to represent the transition matrix. For each 
state of the Minsky machine, we use two variables corresponding to the two 
counters. Throughout the simulation, if the Minsky machine is in state j, the 
counter values are stored in the exponents of the variables associated with state 
j, and all other variables are zero. 

The crux of our reduction lies in the handling of the zero test. More precisely, 
suppose we need to branch depending on whether x is equal to 0, then we need 
to define linear transitions that transfer the values of the two counters from 
one pair of variables to the appropriate new pair of variables. This is done using 
filter functions: the function filter, (u, v) (resp. filter_(u, v)) is equal to v if v > u 
(resp. v < u) and to 0 otherwise. We end this sketch with the construction of 
these functions and proof that they operate as advertised. 
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Lemma 1. Given u,v of the form 10° with c € N, one can compute the value 
w = filter; (u,v) in three linear operations with floating-point rounding. 


Proof. We compute w = filter (u,v) in three successive operations using two 
temporary variables, temp and temp2, initially set at 0 (recall, rounding is ap- 
plied after each step): 
temp <-ut+v 
temp2 + temp — u 
w «+ 1.1 x temp2 
Let c1,c2 € N such that u = 10“ and v = 10®. Recall that the notation [-] is 
the floating-point rounding function. 
First observe that if c1 = ca: 
temp + [10% + 10%] = 2-10% 
temp2 + [2-107 — 10°] = 10° (= v) 
w + [1.1 - 10%] = 10% =v as required. 
Secondly, assume that u > v, and thus c1 > c2: 
temp + [10% + 10°] = 10“ =u 
temp2 + [10% — 107] = 0 
w + [1.1-0]=0 as required. 
We split the case that v > u, thus c2 > c1, into two cases. Suppose cz > cy + 1: 
temp + [10% + 10°] = 10% =v 
temp2 + [10° — 10%] = [0.99...99-10°] =1-10% = v 
—S—S’” 
c2—c1 >2 
w + [1.1 - 10°] = 10% =v as required. 
Finally, co = cı +1: 
temp + [10% + 10°] = 10% = v 
temp2 + [10% — 10%] = [0.9- 10°] = 9- 10°271 
w  [1.1-9-10°-1] = [9.9- 10-1] = 10-10°-1 = 10% = v 
as required. 


Corollary 1. Given u,v of the form 10° with c € N, one can compute the value 
w = filter_(u,v) in four linear operations with floating-point rounding. 


Proof. Observe that filter_(u,v) = v — filter4 (u,v), which can be encoded in 
four steps by first computing filter;(u,v) in three steps. 


4 Pseudo-periodic orbits of non-negative LDS 


We shift our focus to proving that model checking is decidable for systems with 
non-negative matrices. We first establish the behaviour of the system in this 
section and then complete the proof of Theorem 2 in Section 5. Our main result 
is that the rounded orbit of an LDS is periodic in the following sense, which we 
call pseudo-periodic. 


Definition 4. A sequence (2 Jien of d-dimensional vectors of floating-point 
numbers is called pseudo-periodic if and only if there exists a starting point N € 
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N, period T € N and growth rates &œi,...,œ&q € Z such that 
Yt > N,Vj € {1,..., d}, (a(t), = 10% (2), 


We say the sequence is effectively pseudo-periodic if the defining constants 
N,T,&1,...,Q@q can be computed. 


Theorem 3. Let (M,x) be a d-dimensional LDS where M is non-negative and 
let (x® Jren be its rounded orbit. 
The rounded orbit (tren is effectively pseudo-periodic. 


In order to establish this result, we will find some partitions of the graph 
associated to M such that each part is effectively pseudo-periodic with the same 
increasing rate a for every state in the partition. 


4.1 Preprocessing periodicity 


The core of our approach is to show that, within each SCC of the graph associated 
to M, the values associated with states are of similar magnitude. This is however 
only true if the SCC is aperiodic. When a state is in a periodic SCC its value 
could change drastically depending on which phase the system is in. For example, 
consider a simple alternation between two states, in which the value is very large 
in one state and very small in the other; the states will alternate between big 
and small values. 

We “hide” these periodic behaviours by blowing up the system so that each 
SCC of the new system describes only one of the periodic subsequence and we 
will subsequently show that the value of each state in an SCC is either zero or 
of a similar magnitude. 

We apply the following construction to our system. Let P be the period, 
defined as the least common multiple of the length of every simple cycle in the 
graph. Let Q be the indices of M (i.e. the states of the generated automaton). 
We define new states Q’ = Q x {0,...,P — 1} by annotating each state in Q 
with the phase. To avoid cluttering notation we will regularly refer to states in 
Q’ in the form (q,i+ £) for £ € Z, on the understanding that the phase, i + £, 
is normalised into {0,...,P — 1} by taking the residue modulo P if necessary. 
We define a new matrix M’ over the states Q’ such that M(q i41) (qi) = Maa’ 
for i € {0,...,P — 1}, and zero otherwise. We initialise a new starting vector 
ay = aq and 2), =0 for ¢€ {1,...,P— 1}. 

Intuitively, at each time step t the vector generated by the original system is 
equal to the vector of the new system restricted to the states indexed by i = t 
mod P and every state with another index is equal to 0. 

Let S C Q be a strongly connected component. In Q’ there exists strongly 
connected components 5j,...,5%, C Q with k < |S| such that x Si=S*x 
{0,...,P — 1}. Each set S% is periodic, with period P. 

Henceforth in the rest of this section we work on the system (M’, x’) implicitly 
over states Q’ which, by overloading of notation, we rename (M,x) over Q to 
avoid cluttering notation. 
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Note that this transformation also requires to marginally complicate the tar- 
gets. Indeed, consider a set Y C RE. We define the sets Y/i for i < P such that 
Y/i = {yE R? | Fy EY : Yai) = Yq for q E Q and yg j = 0 for j F i}. 
The hitting times of Y, Z(Y), in the original LDS can then be obtained in the 
new LDS as the disjoin union: Uje so... p14 Z(¥/%). It suffices to characterise 
the hitting times for each Y/i. 


yerri 


4.2 Pseudo-periodicity within top SCCs 


Let us first consider top SCCs, these are SCCs with no incoming edges from states 
of other SCC, and therefore the value of each variable at each step depends only 
on the value of states in the same SCC. 


Lemma 2. Let S; be a strongly connected component of (M, x). Let Sji = 
{(q,i) € Sj} be the states associated with S; from the i-th phase. 
There exists C < Pd?, such that, for every i,j, (M°)s,, is positive. 


Fst 
Proof. The matrix (M?’)g, , is non-negative, irreducible (i.e., its graph is strongly 
connected) and of period 1. As such, (M”)s, , is primitive [9] which means that 


a power C” of this matrix is positive. The theorem follows with C = PC’. More- 
over, C’ is at most d? — 2d + 2 [25]. 


Our goal is to show that within an SCC, each of the non-zero entries are of 
a similar magnitude due to the presence of a relatively short path (C) between 
any two states in the SCC. To do this we introduce the notion of closeness and 
observe some useful properties. 


Definition 5. We say two numbers x,x' € FPi0[p| are 6-close, denoted by x %5 
x’ if |exponent(x) — exponent(z’)| < 6. In particular, for every 5 > 0, zero is 
assumed to be -close only to itself. 

We extend the notion to vectors y,y € FPyo[p]*, indezed by S C Q, such 
that y %5 y' if all entries of the same phase are 6-close to one another across 
both y and y’, that is, for each phase i € {0,..., P — 1} and all (q, 7), (q'i) E€ S: 
Yai) FS Vig ay» Yai) F Yq" i) Nd Yig i) FS Yir iy: 


Proposition 1. Let x,x' € FPio|p] be non-zero floating-point numbers. 


(1) If x %5 x then 10-91 < 2/2’ < 10%. 
(2) If 107? < x/z' < 10° then x %42 2". 
(3) If £ %5 x' and a’ =, x” then £ %54n44 2". 


Lemma 3. Let Sj be a top strongly connected component of (M, x), and let C 
be as given by Lemma 2. 
There exists B € N such that for all (q,i), (q',i) E S; and every t > C then 


— ift#i mod P, then E a =0, 
(t) (t) 


— otherwise, Tigi) TB Tigi 
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Proof. Let t € N. Ift#i mod P then z = 0 for all (q, i) € S}; by construc- 
tion. 


Otherwise, let m > max max (M4, (Mz) +) be a constant larger 
= pr eQ:M, „#0 (Mag (Mag) *) 8 


than all values occurring in M and so that 4 is smaller than all non-zero values 
appearing in M. Let c be the constant from the log bounded property of the 
rounding function [-] and d be the dimension of M. 

Observe that for all t € N with t = i mod P we have 


(t) (t—1) 
Tigi) = ` M(q,i),(q' i-1)® (q' i—1) 
a 1) 
=ï 
>= = > Meat 2 1) (by log bounded) 
“a, i—1) 
1 
— ax gt) (by defn of m) 


~ cM (q’,i-1) s.t. Mino, (q’,i-1)>9 “lai, i=l) 
In particular 


xO) 1 a1) ; 
Liga) Z Tawi) for all (g',i— 1) s.t. Miq), (q, i-1) > 0 


Using induction we obtain: 


(t+k) 1 (t+) 1 
Uqitk) = (em) T (a! #41) 2 (em) (a) 


: 4 Ie 
for all (q’,i +1), (q”, i) such that MẸ- ay qqr i1) > O and Meg iriri > O 
In particular, we have ae ) > ayer ) i) for all q’ (since MG, Dad > 0 


for all q’ by the previous lemma). 
On the other hand we have 


(t+1) 


= E) (t) 
(q,i+1) ~~ DD M(q,i-+1),(a',8)¥ (qi) <med_ max Legs yi)" 


ES; 
g'?M(q,i+1),(q! i) >9 ee 


By induction we get that CG < (mcd)? max(y ies, 7a i Hence, for all 
qq E€ Sj we have 


1 
mae: ee i i) < toy and a < (med)? max a”) 
mec q” ie 5 , , P 


Hence —- < d?(me)". 
(a,i) 
Setting y = [log,)d°(mc)?°], we thus have that poart ; 
102+ for all (q, i), (q',4) € Sj; and t € N. Then z% „ and a j are B = y+2 
close by Proposition 1. 
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Lemma 4. Let S; be a top strongly connected component of (M,x). Then the 


sequence (cy) ren is effectively pseudo-periodic. 


Proof. Let 8 and C be as in Lemma 3. Denote q1,...,dm the states of Sj. We 
define the sequence (yY)iseo such that for all t > C and q € S; denoting 
(p)4 = mantissa (|{"]) and (a), = exponent((z\"]) we have that y = 
(Dai; 0, Paz; qo — Qq» -+ -s Pam am — Qq ). Note that this sequence can only take 
finitely many values as the mantissas have a precision of p decimals and by 
Lemma 3, for all k < M, ag, —&q, E {—6,..., 8}. As a consequence, the sequence 
(y)isc takes the same value multiple times. Let kı and k2 be the two distinct 
minimal integers such that y(*) = y*2), Setting a = afk?) — abt) We have that 
aki) = g2) .10°%, Since [-] is mantissa-based, one can show by induction that for 
allt > 0, (+4) = g(k2+t) . 10%, Therefore the sequence (2? ren is effectively 
pseudo-periodic with period T = kə — kı and starting point N=C+ ky. 
Moreover, as the maximum number of different values taken by (y)isc is 
known, we can deduce that both kı and kg—k, are smaller than 10°” (28+1)"+1. 


Note that the increasing rate is the same for every state of the strongly connected 
component. 


4.3 Pseudo-periodicity within lower SCCs 


We consider a strongly connected component Sme, which is fed by at least one 
strongly connected components F),..., Fe, 4> 1. We let Sp = Fi U- -- U Fy and 
assume every F; is pseudo-periodic. 

In this section we show 


Theorem 4. Sme is effectively pseudo-periodic and the growth rate of Sme is 
the same for all q E Sme. 


We first observe that the difference between values in Sme is bounded. This 
is achieved with a proof similar to the one of Lemma 2 and Lemma 3 (though 
having to combine considerations of Sme and Sp). 


Lemma 5. There exists n, N' € N, such that for all (q,i),(q',i) E€ Sme, all 
t> N’ and alli € {0,..., P — 1} then 

— ift#i mod P, then i. = 0, 
(t) . ,(t) 


— otherwise, £; n My UP)». 
7 (qt) n (ai) 


Definition 6. We say that ol) is influenced by Sp if 


(t—1) (t—1) (t-1) 
an = > Mq ty + > Mg qi Ey F > Mg,’ © q 
q'ESF d'ESme VESme 
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and in particular al) is influenced by u € Sp if: 


= —1 
» Maat | # 2 Maa y 


q'€SFUSme q'€SrUSme\{u} 


We can restrict Sp to the F; in Sp with the maximum growth rate. Indeed, 
from some point on, any F; with non-maximal growth rate is much smaller than 
the maximal ones, and as by the proof of Lemma 5 the values within Sme are 
close to (or greater than) the maximum value within Sp, this F; would not 
influence with any ol) with q E€ Sme. Let N be the point from which we can 
assume, that the elements of Sp are much larger than any other feeding SCCs 
and are thus the only ones potentially influencing of Sme- 

Since each F; is assumed to be pseudo-periodic, we have that Sr pseudo- 
periodic. Let T be the period of Sp, Nə be the starting point and a be the 
growth rate of every state of Sr (meaning the exponent of every state changes 
by a every T starting form the N-th step.) Let N = max{ N1, No}, that is, the 
point from which we can assume Sp is both pseudo-periodic and dominating 
non-maximal SCCs feeding Sine. 

As a direct consequence of having the same growth rate, the non-zero terms 
within Sp are close: 


Proposition 2. If a sequence of non-zero floating-point vectors (vO Jien is pseudo- 
periodic with the same growth rate within a set Q, then there exists 6 such that 
for allq,d E€ Q and allt > N, v AT vw. 


Moreover, either Sp does not influence Sme, or they are close. 


Lemma 6. There exists B,N EN such that: 


Fort > N and (q,t) E Sme, if aon is influenced by (q’,i—1) € Sr, then 
(t) 


Tiri) YB ae for all (r,i), (r',i) E Sme U SF. 
We will show Theorem 4 through the following observation: 


Observation 1. Observe that Sp either influences Sme infinitely many times or 
finitely many times. We have two cases: 


— If Sp influences Sme infinitely often, then they are infinitely often -close by 
Lemma 6. Then we will observe through a simultaneous version of Lemma 4 
that Sme is pseudo-periodic. 

— If Sp influences Sme only finitely often, then clearly from some point on Sme 
behaves like a top SCC, and thus is pseudo-periodic directly by Lemma 4. 


It will then remain to show that we can detect which of the two cases applies, 
and place a bound on the time to detect this, which will effectively reveal the 
constants of the pseudo-periodic behaviour. 

We now present a version of Lemma 4 to observe that if Sp and Sme are 
infinitely often -close then Sme is pseudo-periodic: 
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Lemma 7. Suppose cH) 2B oy for infinitely many t. Then there exists ty < 


t2, such that ae xg One and a? 2) ny gue”, ee = = 10s% D) and cga) = 


10zË D. n particular, the sequence D ya is E ee with period 
(t2 — en starting from tı with growth rate of y in every state. 


Proof. At a time t such that a) =g ae , we denote the vectors a) 


FP o[p]!5"! and c € FP,o[p]!S~e! respectively 


€ 


i Naat () 4 (4) 
(m 108, mP... mO 10 +9186) and 


(n® 19140? be T 107 +615 mel), 
where m;i, n; are taken from the finite set of mantissa values expressible in p bits, 
vy) €Z and a;,¢ € ZN [-B, 6] denote the offset from a), 

Let F bound the number of possible values m;,n;,a;,¢; can take on, where 


F < 10PSl+1Smel) . (26 + 1)lSrl+!Smel-1_ By the pigeonhole principle, after at 


most F + 1 times in which x xB cy ) there must exist two times tı < t2 


where the values of mi, ni, a;, BiS are il (although the value of y could be 


(t2) 10? (t1) 
different), thus rgryg = Tor) SpUS me" 


Since the rounding function is mantissa-based, the system evolution from 
x) is equivalent to the systems evolution from a2) = 107a), where y is the 
growth rate, 2) — y(t), 


We can in fact decide whether a) 8B 9. for the last time: 


Lemma 8. Let 868, N be defined as in Lemma 6. If t > N then it is decidable 
whether there exists t! >t such that xt} xg ae 


Proof Sketch (Full proof available in [19]). If we considered Sme in isolation, 
without the effect of Sr, we know it would be pseudo-periodic. We can simulate 
one period of Sme with and without the effect of S and determine if Sp influ- 
ences Sme within one period. If it does then they must be close at this point. If 
Sr does not influence Sme we know that Sme will behave pseudo-periodically at 
least until Sp is close to Sme again; having established a growth rate for Sime, 
we can compare the growth rates of Sp and Sme to see if Sme will ever be close 
to Sp again in the future. 


Finally to conclude the proof of Theorem 4, we refine Observation 1 to show 
that the period is bounded and thus the growth rates are computable: 


— either Sp is -close to Sme infinitely often, in particular if they become close 
F +1 times then by Lemma 7 it is pseudo-periodic. 

— or the system is pseudo-periodic because it behaves like a top-SCC, in which 
Lemma 4 gives effective computation of the constants. 


Which of these occurs is determined by at most F + 1 applications of Lemma 8. 
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5 Decidability of model checking 


In this section we use the results obtained in the previous section to show that 
model checking is decidable. We use pseudo-periodicity to show that the charac- 
teristic word is eventually periodic, a case for which model checking is decidable. 


Theorem 2. Let (M, x) be a non-negative linear dynamical system, let Y1,..., Yp 
be semialgebraic targets and let o be an MSO formula using predicates over 
Yı,..., Yp. It is decidable whether the characteristic word under floating-point 
rounding satisfies ©. 


Consider a semialgebraic target Y, which can be expressed as a Boolean com- 
bination of polynomial inequalities over variables representing the dimensions. 
That is Y = {(a1, eae Xd) | Ni Vj P;j(z1, era 52) Dij 0}, where Dij E {>, >, =h 

Given a linear dynamical system (M, x) defining the rounded orbit (x), 


recall that Z(Y) = {n | a™ € Y} are the hitting times of Y. We claim that this 
set is semi-linear (equivalently eventually periodic) for semialgebraic Y. 


Definition 7. A 1-dimensional linear-set, defined by a base b € N and period 
p EN, is the set {a | 3k EN: 2 =b+k-p}. A semi-linear set is the finite union 
of a finite set F CN and linear sets. It can be assumed that each linear-set has 
the same period. Hence a 1-dimensional semi-linear set X is defined by a finite 
set F CN and integers m,p,b1,...,bm E N such that x € X if and only if x € F 
orzx=b+k-p for some k EN and b€ {hy,..., bm}. 


Theorem 5. Let Y be a semialgebraic target, Z(Y) is a semi-linear set. 


Theorem 5 essentially completes the proof of Theorem 2. It is almost immediate 
that the characteristic word is eventually periodic (see the long version [19] for a 
formal proof) and thus the model-checking problem can be decided by checking 
ANB =, where A is an automaton representing the characteristic word and 
B encodes the language of ¢. 

It is standard that semi-linear sets are closed under intersection, union, and 
complementation (see [15] for a nice introduction to semi-linear sets). Thus in 
order to express the hitting times of Z(Y) it is sufficient to express the hitting 
times of {(z1,..., £a) | P(a1,...,2%n) > 0} for a finitely many polynomials P. 
Conjunction is found by taking the intersection of the hitting times, and disjunc- 
tion by taking union. The hitting times of P(x1,...,2%) > 0 can be rewritten as 
the complement of the hitting times of —P(x1,...,2%,) > 0. The hitting times 
of P(z1,..., £n) = 0 is the conjunction (intersection) of P(x1,...,%,) > 0 and 
—P(a1,...,%n) > 0. Thus Theorem 5 is a consequence of the following lemma. 


Lemma 9. Assume a) = (20, er zi), is a pseudo-periodic sequence with 
start point N, period T and growth rates ay,...,Q@, and P € Q[axi,--- , xa] a 
rational polynomial in d variables.? Then, {i € N | P(g), zi) > O} is a 
semi-linear set. 


? Some variables may be redundant, that is, if the polynomial does not depend on all 
dimensions of x) then some of the variables may not appear in P. 


62 E. Lefaucheux et al. 


Proof. First, we show that pseudo-periodicity is closed under product. Suppose 
gfNtPn) = m,10%t%™ and p an = mj;10%+°7", Observe that soe . 
p = m; : 10%: term, . 1081 +95" = mamy - 108: t8 tnli tas), We conclude 
that the vector (x; : xj)® is pseudo-periodic with growth rate a; + a;. Observe 
that the mantissa precision increase by at most 2. 

Secondly, we show that if two pseudo-periodic sequences have the same 
growth rate, then their sum is also pseudo-periodic with the same growth rate. 
Suppose gtr = m10%+°™, and ace = m;10%+*", Observe that 
(xi + ay) N+P = malfi +e" 4+ m10F te" = (mi + mj - 1081-8) 107% tom, 
Observe that the mantissa precision increased by at most 10!95—8:!, 

Let P(z1,..., £n) = yy CiZi, where Z; is a product of 21,..., £n. Consider 
each monomial Z; occurring in P, since produce preserves pseudo-periodicity, we 
conclude that Z; is pseudo-periodic. P“ is thus a linear combination of these 
pseudo-periodic vectors. Note our prior observation does not immediately imply 
that P“ is pseudo-periodic as we required taking the sum of elements with the 
same growth rate. However, from some point on, we are only interested in those 
with the maximal growth rate. 

Without loss of generality, let Z1,..., Zr have the maximum-growth rate, 
and Z,41,...,Zy have strictly smaller growth rate. For every L € N there 
exists N € N such that for all t > N, exponent(Z.) — exponent( Z?) >L. 

Hence there exists N € N such that for all t > N if pS c;Z; > 0 if and 


only if aA = Wah + ieee GZ, > 0 because iar aZil < 
|S<"_, Z| from some point on. Hence sign(S>_, GZ) = sign()>,_, GZ). 


i 


Thus we restrict our attention to Yi Zl), Since each of the Z; for 


i € {1,...,r} have the same growth rate, we know that }7)_, QZ) is pseudo- 
periodic. Since sign(}7;_, oF) does not depend on the exponent, only the 
periodic mantissa, we have that the sign is periodic. The hitting times for t < N 
can be determined exhaustively and included in the finite set of the semi-linear 


set. 
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Abstract. Bounded model checking (BMC) is an effective technique for 
hunting bugs by incrementally exploring the state space of a system. To 
reason about infinite traces through a finite structure and to ultimately 
obtain completeness, BMC incorporates loop conditions that revisit pre- 
viously observed states. This paper focuses on developing loop conditions 
for BMC of HyperLTL— a temporal logic for hyperproperties that allows 
expressing important policies for security and consistency in concurrent 
systems, etc. Loop conditions for HyperLTL are more complicated than 
for LTL, as different traces may loop inconsistently in unrelated moments. 
Existing BMC approaches for HyperLTL only considered linear unrollings 
without any looping capability, which precludes both finding small in- 
finite traces and obtaining a complete technique. We investigate loop 
conditions for HyperLTL BMC, for HyperLTL formulas that contain up to 
one quantifier alternation. We first present a general complete automata- 
based technique which is based on bounds of maximum unrollings. Then, 
we introduce alternative simulation-based algorithms that allow exploit- 
ing short loops effectively, generating SAT queries whose satisfiability 
guarantees the outcome of the original model checking problem. We also 
report empirical evaluation of the prototype implementation of our BMC 
techniques using Z3py. 


1 Introduction 


Hyperproperties [13] have been getting increasing attention due to their power to 
reason about important specifications such as information-flow security policies 
that require reasoning about the interrelation among different execution traces. 
HyperLTL [12] is an extension of the linear-time temporal logic LTL [81] that 
allows quantification over traces; hence, capable of describing hyperproperties. 
For example, the security policy observational determinism can be specified as 
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HyperLTL formula: Va.Vr' (0, > Og) Wr(ig © tg’), which specifies that for 
every pair of traces 7 and 7x’, if they agree on the secret input i, then their 
public output o must also be observed the same (here ‘W’ denotes the weak 
until operator). 

Several works [14,22] have studied model checking techniques for HyperLTL 
specifications, which typically reduce this problem to LTL model checking queries 
of modified systems. More recently, [27] proposed a QBF-based algorithm for the 
direct application of bounded model checking (BMC) [11] to HyperLTL, and suc- 
cessfully provided a push-button solution to verify or falsify HyperLTL formulas 
with an arbitrary number of quantifier alternations. However, unlike the clas- 
sic BMC for LTL, which included the so-called loop conditions, the algorithm 
in [27] is limited to (non-looping) linear exploration of paths. The reason is that 
extending path exploration to include loops when dealing with multiple paths si- 
multaneously is not straightforward. For example, consider the HyperLTL formula 


yy = Yr.3n'. Olar —> br’) and two Kripke structures Ky and Kə as follows: 


Assume trace 7 ranges over Ky and trace 7’ ranges over Ky. Proving (K1, K2) 4 
1 can be achieved by finding a finite counterexample (i.e., path s;s2s53 from Kı). 
Now, consider yo = Va.dr’. O(ar © ar). It is easy to see that (K1, K2) = g2. 
However, to prove (K1, K2) H p2, one has to show the absence of counterexam- 
ples in infinite paths, which is impossible with model unrolling in finite steps as 
proposed in [27]. 

In this paper, we propose efficient loop conditions for BMC of hyperproper- 
ties. First, using an automata-based method, we show that lasso-shaped traces 
are sufficient to prove infinite behaviors of traces within finite exploration. How- 
ever, this technique requires an unrolling bound that renders it impractical. In- 
stead, our efficient algorithms are based on the notion of simulation [32] between 
two systems. Simulation is an important tool in verification, as it is used for ab- 
straction, and preserves ACTL* properties [6,24]. As opposed to more complex 
properties such as language containment, simulation is a more local property 
and is easier to check. The main contribution of this paper is the introduction 
of practical algorithms that achieve the exploration of infinite paths following 
a simulation-based approach that is capable of relating the states of multiple 
models with correct successor relations. 

We present two different variants of simulation, SIMea and SIMaeg, allowing to 
check the satisfaction of SV and VS hyperproperties, respectively. These notions 
circumvent the need to boundlessly unroll traces in both structures and synchro- 
nize them. For SIMae, in order to resolve non-determinism in the first model, we 
also present a third variant, where we enhance SIMae by using prophecy vari- 
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ables [1,7]. Prophecy variables allow us to handle cases in which Y3 hyperprop- 
erties hold despite the lack of a direct simulation. With our simulation-based 
approach, one can capture infinite behaviors of traces with finite exploration 
in a simple and concise way. Furthermore, our BMC approach not only model- 
checks the systems for hyperproperties, but also does so in a way that finds 
minimal witnesses to the simulation (i.e., by partially exploring the existentially 
quantified model), which we will further demonstrate in our empirical evaluation. 
We also design algo- 


rithms that generate SAT Case y with ay with © 
formulas for each vari- = = = 
ant (ie, SlMea, SlMae, |Vsman Jig | SIMae > H Vay | BMC > jE VS 
and SiMae with prophe- Vig Jsma11| SIMae > H Vay | BMC > Va 
cies), where the satisfiabil- = 

ity of formulas implies the |Jsma11 Vbig | SIMea > H AVOy | BMC > j Oy 
model checking outcome. Jig Vemaii| SIMea > = AVOv | BMC > j IV 


We also investigate the 
practical cases of models 
with different sizes leading 
to the eight categories in 
Table 1. For example, the 
first row indicates the category of verifying two models of different sizes with the 
fragment that only allows Y3 quantifiers and LO (i.e., globally temporal operator); 
Vsma11 big means that the first model is relatively smaller than the second model, 
and the positive outcome (= YJO) can be proved by our simulation-based tech- 
nique SIMag, while the negative outcome (4 VAD y) can be easily checked using 
non-looping unrolling (i.e., [27]). We will show that in certain cases, one can 
verify a O formula without exploring the entire state space of the big model to 
achieve efficiency. 

We have implemented our algorithms! using Z3py, the Z3 [15] API in python. 
We demonstrate the efficiency of our algorithm exploring a subset of the state 
space for the larger (i.e., big) model. We evaluate the applicability and effi- 
ciency with cases including conformance checking for distributed protocol syn- 
thesis, model translation, and path planning problems. In summary, we make 
the following contributions: (1) a bounded model checking algorithm for hyper- 
properties with loop conditions, (2) three different practical algorithms: SIMea, 
SIMae, and SIMae with prophecies, and (3) a demonstration of the efficiency and 
applicability by case studies that cover through all eight different categories of 
HyperLTL formulas (see Table 1). 

Related Work. Hyperproperties were first introduced by Clarkson and Schnei- 
der [13]. HyperLTL was introduced as a temporal logic for hyperproperties in [12]. 
The first algorithms for model checking HyperLTL were introduced in [22] us- 
ing alternating automata. Automated reasoning about HyperLTL specifications 
has received attention in many aspects, including static verification [14,20,21,22] 
and monitoring [2,8,10,18,19,26,33]. This includes tools support, such as MCHy- 


Table 1: Eight categories of HyperLTL formulas with 
different forms of quantifiers, sizes of models, and 
different temporal operators. 


1 Available at: https://github.com/TART-MSU/loop_condition_tacas23 
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per [22,14] for model checking, EAHyper [17] and MGHyper [16] for satisfiability 
checking, and RVHyper [18] for runtime monitoring. However, the aforementioned 
tools are either limited to HyperLTL formulas without quantifier alternations, or 
requiring additional inputs from the user (e.g., manually added strategies [14]). 

Recently, this difficulty of alternating formulas was tackled by the bounded 
model checker HyperQB [27] using QBF solving. However, HyperQB lacks loop 
conditions to capture early infinite traces in finite exploration. In this paper, we 
develop simulation-based algorithms to overcome this limitation. There are alter- 
native approaches to reason about infinite traces, like reasoning about strategies 
to deal with Y3 formulas [14], whose completeness can be obtained by gener- 
ating a set of prophecy variables [7]. In this work, we capture infinite traces 
in BMC approach using simulation. We also build an applicable prototype for 
model-check HyperLTL formulas with models that contain loops. 


2 Preliminaries 


Kripke structures. A Kripke structure K is a tuple (S, S°, ô, AP, L), where S 
is a set of states, S? C S is a set of initial states, ô C S x S is a total transition 
relation, and L : S — 2^ is a labeling function, which labels states s € S with 
a subset of atomic propositions in AP that hold in s. A path of K is an infinite 
sequence of states s(0)s(1)--- € S”, such that s(0) € 9°, and (s(i), s(i+1)) € ô, 
for alli > 0. A loop in K is a finite path s(n)s(n+1)---s(£), for some 0 < n < £, 
such that (s(i),s(i + 1)) € ô, for all n < i < £, and (s(£), s(n)) € 6. Note that 
n = ¢ indicates a self-loop on a state. A trace of K is a trace t(0)t(1)t(2)--- € X®Ħ, 
such that there exists a path s(0)s(1)--- € S® with t(i) = L(s(i)) for all i > 0. 
We denote by Traces(K,s) the set of all traces of K with paths that start in 
state s € S. We use Traces( K) as a shorthand for UJ eso Traces(K,s), and L(K) 
as the shorthand for Traces( K). 


Simulation relations. Let K4 = (S4,5%,54,AP4, La) and Kg = (Sp, S%, ôB, 
APB, Lg) be two Kripke structures. A simulation relation R from K4 to Kg is 
a relation R C S,4 x Sg that meets the following conditions: 
1. For every s4 € S9 there exists sp € Oy such that (54,58) € R. 
2. For every (s4,5B) € R, it holds that L4(s4) = Lg(sB). 
3. For every (s4,58) € R, for every (s4,5',) E€ Oa, there exists (sp, s),) € ôB 
such that (s‘4, s3) € R. 


The Temporal Logic HyperLTL. HyperLTL [12] is an extension of the linear- 
time temporal logic (LTL) for hyperproperties. The syntax of HyperLTL formulas 
is defined inductively by the following grammar: 


pu= dry |Vrp|¢ 
P ::= true | ar |7¢| OV d|dAG| OU P| dR | OG 
where a € AP is an atomic proposition and 7 is a trace variable from an infinite 


supply of variables V. The Boolean connectives =, V, and A have the usual 
meaning, U is the temporal until operator, R is the temporal release operator, 
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and © is the temporal neszt operator. We also consider other derived Boolean 
connectives, such as — and +>, and the derived temporal operators eventually 
Oy = true U y and globally Oy = 7O-y. A formula is closed (i.e., a sentence) 
if all trace variables used in the formula are quantified. We assume, without loss 
of generality, that no trace variable is quantified twice. We use Vars(y) for the 
set of trace variables used in formula y. 


Semantics. An interpretation T = (Tr)revars(y) Of a formula vy consists of a 
tuple of sets of traces, with one set Tp per trace variable m in Vars(y), denoting 
the set of traces that m ranges over. Note that we allow quantifiers to range over 
different models, called the multi-model semantics [23,27|?. That is, each set of 
traces comes from a Kripke structure and we use K = (Kx) ,evars(y) to denote 
a family of Kripke structures, so T = Traces(K,,) is the traces that 7 can range 
over, which comes from K, E€ K. Abusing notation, we write T = Traces(K). 

The semantics of HyperLTL is defined with respect to a trace assignment, 
which is a partial map H: Vars(p) — £“. The assignment with the empty 
domain is denoted by Ig. Given a trace assignment II, a trace variable 7, and 
a concrete trace t E€ X”, we denote by I[7 — t| the assignment that coincides 
with JI everywhere but at m, which is mapped to trace t. The satisfaction of 
a HyperLTL formula y is a binary relation = that associates a formula to the 
models (7, J, i) where i € Z>o is a pointer that indicates the current evaluating 
position. The semantics is defined as follows: 


T, 11,0) = 3r. wb iff there isa t€ fT,, such that (T, [x — t],0) Ey, 
T, 11,0) = Yr. wb if for allt € Tp, such that (T, H[r —> ¢],0) = Y, 
T, I,i) 
T, H, i) = ar if ac H(r)(i), 
T, I,i) = 7w if (T, I,i) yT, H, i) 
a = Yı V p2 iff (T, Ii) E v1 or (T, I,i) = Y2, 
i) 
) 
i) 


Mi) |p Adve iff (T, I,i) y% and (7, I,i) = Y2, 
a = OW iff (7,Mi+1 Ey, 
T, I,i) FuiUy iff there is a j > ifor which (7, H, j) H we and 
for all k € [i, j), (T, IL, k) Kvn, 
T, I,i) Fu Ry iff either for all j >i, (T, IM, j) H Y2, or, 
for some j > i, (T, I, j) H a1 and 
for all k € [i, j] : (7, II, k) H ve. 


We say that an interpretation 7 satisfies a sentence y, denoted by T = ọ, if 
(T, HIg,0) = p. We say that a family of Kripke structures K satisfies a sen- 
tence y, denoted by K = 9, if (Traces(Kx))revars(y) = p. When the same 
Kripke structure K is used for all path variables we write K } y. 


Definition 1. A  nondeterministic Btichi automaton (NBW) is a tuple 
A = (2,Q,Qo0,6,F'), where X is an alphabet, Q is a nonempty finite set of 


? In terms of the model checking problem, multi-model and (the conventional) single- 
model semantics where all paths are assigned traces from the same Kripke struc- 
ture [12] are equivalent (see [23,27]). 
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states, Qo C Q is a set of initial states, F C Q is a set of accepting states, and 
6CQ~x x Qisa transition relation. 


Given an infinite word w = o102--: over X, a run of A on w is an infinite 
sequence of states r = (qo,q1,---), such that go € Qo, and (q@i-1,0:,qi) € 6 for 
every i > 0. The run is accepting if r visits some state in F infinitely often. We 
say that A accepts w if there exists an accepting run of A on w. The language of 
A, denoted £(A), is the set of all infinite words accepted by A. An NBW A is 
called a safety NBW if all of its states are accepting. Every safety LTL formula 
w can be translated into a safety NBW over 24° such that £(A) is the set of all 
traces over AP that satisfy 4) [29]. 


3 Adaptation of BMC to HyperLTL on Infinite Traces 


There are two main obstacles in extending the BMC approach of [27] to handle 
infinite traces. First, a trace may have an irregular behavior. Second, even traces 
whose behavior is regular, that is, lasso shaped, are hard to synchronize, since 
the length of their respective prefixes and lassos need not to be equal. For the 
latter issue, synchronizing two traces whose prefixes and lassos are of lengths 
Pi, p2 and 1,12, respectively, is equivalent to coordinating the same two traces, 
when defining both their prefixes to be of length max{p1, p2}, and their lassos to 
be of length Ilem{l1,/2}, where ‘Icm’ stands for ‘least common multiple’. As for 
the former challenge, we show that restricting the exploration of traces in the 
models to only consider lasso traces is sound. That is, considering only lasso- 
shaped traces is equivalent to considering the entire trace set of the models. 

Let K = (S,S°,6,AP, L) be a Kripke structure. A lasso path of K is a path 
s(0)s(1)...s(€) such that (s(£), s(n)) € 6 for some 0 < n < £. This path induces 
a lasso trace (i.e., a lasso) (89)... L(sn-1) (L(sn)...£(se))”. Let (K1,..., Kx) 
be a multi-model, we denote the set of lasso traces of K; by C; for all 1 < i < k, 
and we use £(C;) as the shorthand for the set of lasso traces of K;. 


Theorem 1. Let K = (Ki,...,K,) be a multi-model, and let p = Qim.---Qx 
Tk- be a HyperLTL formula, both over AP, then K = iff (Ci,...,Ck) Fy. 


Proof. (sketch) For an LTL formula Y over AP x {7;}*_,, we denote the trans- 
lation of Y to an NBW over QAP x {mi fia by Ay [34]. Given a = QıTı `- QkTk, 
where Q; € {5,V}, we define the satisfaction of A, by K w.r.t. a, denoted 
K H (a, Ay), in the natural way: Jz; corresponds to the existence of a path 
assigned to m; in K;, and dually for Yr;. Then, K — (a, Ay) iff the various k- 
assignments of traces of K to {7;}*_, according to a are accepted by Ay, which 
holds iff K E y. 

For a model K, we denote by K Nx Ay the intersection of K and Ay w.r.t. 
AP x {r}, taking the projection over AP x {m;}}2}. Thus, L(K Ng Ay) is the 
set of all (k — 1)-words that an extension (i.e., 3) by a word in L(K) to a k- 
word in L(A). Oppositely, L(K Nk Ay) is the set of all (k —1)-words that every 
extension (i.e., V) by a k-word in L(K) is in L(A). 
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We first construct NBWs Ag,...,Ap—1, Ak, such that for every 1 < i < k, 
we have (Kı, sss , Ki) = (ai, Ai+1) iff KE (a, Ay), where Qi = Qırı xs . Qiri. 

For i = k, if Q, = J, then Ak = Kk Ok Ay; otherwise if Q, = V, Ak = 
Kk Ok Ay. Forl<i< k; if Q; = 4 then A; = Ki N; Ajai; otherwise if Q; = V, 
A; = K; Ni Ajit. Then, for every l<i< k; we have (Kı, bii , Ki) = (a;, Ai+1) 
iff (Ay,..., Ke) Ey 

We now prove by induction on k that K = y iff (C1,... Cp) = vy. For k = 1, 
it holds that K = p iff Kı = (Qim, Az). If Qı = V, then Kı = (Qu, Az) 
iff ky N Ag = 0. If Qi = 3; then Kı = (Qir, A2) iff Kı N A» Z 0. In both 
cases, a lasso witness to the non-emptiness exists. For 1 < i < k, we prove 
that (Ch, tees Ci, Ki41) = (Qi+1, Ai+2) iff (Ci, sey Ci, Ci41) = (Qi+1, Aj42). If 
Q; = V, then the first direction simply holds because £(Cj41) C L(Ki+1). For 
the second direction, every extension of ¢1,¢2,...¢; (i-e., lassos in C1, C2,...C;) 
by a path 7 in Kj4, is in £L(Aj42). Indeed, otherwise we can extract a lasso 
ci+1 such that ci,c2,...ci41 is in L(Aj+2), a contradiction. If Q; = J, then 
L(Ci+1) C L(Ki+1) implies the second direction. For the first direction, we can 
extract a lasso ci41 E€ L(Ci+1) such that (ci, c2,...ci,ci41) E€ L(Ai+2). 


One can use Theorem 1 and the observations above to construct a sound 
and complete BMC algorithm for both V4 and 3V hyperproperties. Indeed, 
consider a multi-model (Kı, K2), and a hyperproperty y = Vz.da’. y. Such 
a BMC algorithm would try and verify (K1, K2) = ọ directly, or try and prove 
(Kı, K2) = ~y. In both cases, a run may find a short lasso example for the 
model under 3 (/‘2 in the former case and K; in the latter), leading to a shorter 
run. However, in both cases, the model under Y would have to be explored to 
the maximal lasso length implicated by Theorem 1, which is doubly-exponential. 
Therefore, this naive approach would be highly inefficient. 


4 Simulation-Based BMC Algorithms for HyperLTL 


We now introduce efficient simulation-based BMC algorithms for verifying hyper- 
properties of the types Vz.dz’.OPred and Jr.Yr’.OPred, where Pred is a relational 
predicate (a predicate over a pair of states). The key observation is that simu- 
lation naturally induces the exploration of infinite traces without the need to 
explicitly unroll the structures, and without needing to synchronize the indices 
of the symbolic variables in both traces. Moreover, in some cases our algorithms 
allow to only partially explore the state space of a Kripke structure and give a 
conclusive answer efficiently. 

Let Kp= = (Sp, S}, dp, AP p, Lp) and Kg = = (SQ, SQ, 6a; APQ, LQ) be two 
Kripke structures, and consider a hyperproperty of the form Yr.3n’. Pred. 
Suppose that there exists a simulation from Kp to Kg. Then, every trace in 
Kp is embodied in Kg. Indeed, we can show by induction that for every trace 
tp = 8p(1)s,(2)... in Kp, there exists a trace tg = sq(1)s,(2)... in Ka, such 
that s,(i) simulates sp(i) for every i > 1; therefore, tp and tq are equally la- 
beled. We generalize the labeling constraint in the definition of standard simu- 
lation by requiring, given Pred, that if (sp, Sq) is in the simulation relation, then 
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(Sp, Sq) = Pred. We denote this generalized simulation by SIMae. Following sim- 
ilar considerations, we now have that for every trace tp in Kp, there exists a 
trace t, in Kg such that (t,,t,) FH OPred. Therefore, the following result holds: 


$ 


Lemma 1. Let Kp and Kg be Kripke structures, and let p = Yr.3n'. UPred be 
a HyperLTL formula. If there exists SIMae from Kp to Kg, then (Kp, Ko) Fy 


We now turn to properties of the type 3r.Yr’. OPred. In this case, we must 
find a single trace in Kp that matches every trace in Kg. Notice that SIMae 
(in the other direction) does not suffice, since it is not guaranteed that the 
same trace in Kp is used to match all traces in Kg. However, according to 
Theorem 1, it is guaranteed that if (Kp, Ko) H| Ja.Va'. OPred, then there ex- 
ists such a single lasso trace tp in Kp as the witness of the satisfaction. We 
therefore define a second notion of simulation, denoted SIMea, as follows. Let 
tp = 8p(1)s,(2)...8,(n)...8,(2) be a lasso trace in Kp (where s,(€) closes to 
Sp(n), that is, (sp(£), sp(n)) € dp). A relation R from t, to Kg is considered as 
a SIMea from tp, to Kg, if the following holds: 


(Sp, Sq) = Pred for every (sp, sq) E€ R. 

(Sp(1), 54) € R for every sq € SQ. 

If (s,(2), 8q(¢)) € R, then for every successor sq(i + 1) of sq(i), it holds that 
(spi + 1), sali + 1)) E€ R (where sp(l + 1) is defined to be s,(n)). 


If there exists a lasso trace tp, then we say that there exists SIMea from Kp to Kg. 
Notice that the third requirement in fact unrolls Kg in a way that guarantees 
that for every trace tą in Ka, it holds that (t,,t,) = OPred. Therefore, the 
following result holds: 


Lemma 2. Let Kp and Kg be Kripke structures, and let p = Ina’. OPred. 
If there exists a SIMea from Kp to Ko, then (Kp, Ko) E ¢. 


Lemmas 1 and 2 enable sound algorithms for model-checking Yr.3r’. OPred 


and Jr.Yr’. OPred hyperproperties with loop conditions. To check the former, 
check whether there exists SIMae from Kp to Ko; to check the latter, check 
for a lasso trace tp in Kp and SlMea from tp to Kg. Based on these ideas, we 
introduce now two SAT-based BMC algorithms. 

For V3 hyperproperties, we not only check for the existence of SIMae, but 
also iteratively seek a small subset of Sg that suffices to simulate all states of 
Sp. While finding SIMae, as for standard simulation, is polynomial, the problem 
of finding a simulation with a bounded number of Kg states is NP-complete 
(see [28] for details). This allows us to efficiently handle instances in which Kg 
is large. Moreover, we introduce in Subsection 4.3 the use of prophecy variables, 
allowing us to overcome cases in which the models satisfy the property but SIMae 
does not exist. 

For AV hyperproperties, we search for SIMea by seeking a lasso trace tp in 
Kp, whose length increases with every iteration, similarly to standard BMC 
techniques for LTL. Of course, in our case, tp must be matched with the states 
of Kg in a way that ensures SIMea. In the worst case, the length of tp may be 
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doubly-exponential in the sizes of the systems. However, as our experimental 
results show, in case of satisfaction the process can terminate much sooner. 

We now describe our BMC algorithms and our SAT encodings in detail. First, 
we fix the unrolling depth of Kp to n and of Kg to k. To encode states of Kp we 
allocate a family of Boolean variables {x;}?_,. Similarly, we allocate tuH to 
represent the states of Kg. Additionally, we Porn the simulation relation T by 
creating n x k Boolean variables { sim; }/_,,)_, such that sim,; holds if and only 
if T (p;i, qj). We now present the three variations of encoding: (1) EA-Simulation 
(SIMea), (2) AE-Simulation (SIMaz), and (3) a special variation where we enrich 
AE-Simulation with prophecies. 


4.1 Encodings for EA-Simulation 


The goal of this encoding is to find a lasso path t, in Kp that guarantees that 
there exists SIMea to Kg. Note that the set of states that tp uses may be much 
smaller than the whole of Kp, while the state space of Kg must be explored 
exhaustively. We force xo be an initial state of Kp and for 2,4; to follow zi 
for every i we use, but for Kg we will let the solver fill freely each yg and add 
constraints? for the full exploration of Kg. 


e All states are legal states. The solver must only search legal encodings 
of states of Kp and Kg (we use Kp(x;) to represent the combinations of 
values that represent a legal state in Sp and similarly Kg(y,) for Sq): 


k 
N Kele) a N Kaus) (1) 


e Exhaustive exploration of Kg. We require that two different indices yj 
and yr represent two different states in Kg, so if k = |Kgl, then all states are 
represented, where y; # Yr captures that some bit distinguishes the states 
encoded by j and r (note that the validity of states is implied by (1)): 


N Kolus) A Kalur)) > (ug F yr) (2) 
je 
e The initial S% state simulates all initial SẸ states. State xo is an initial 


state of Kp and simulates all initial states of Kg (we use Ip(xq) to represent 
a legal initial state in Kp and Ig(y;) for $2 of Kọ): 


k 
A a (yj) + T (xo, 95) (3) 


3 An alternative is to fix an enumeration of the states of Ko and force the assignment 
of yo... according to this enumeration instead of constraining a symbolic encoding, 
but the explanation of the symbolic algorithm above is simpler. 
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e Successors in Kg are simulated by successors in Kp. We first intro- 
duce the following formula succr(, x’) to capture one-step of the simulation, 
that is, x’ follows x and for all y if T(a,y) then x’ simulates all successors 
of y (we use dg(y,y’) to represent that y and y’ states are in dg of Kg, 
similarly for (x, x’) € dp of Kp we use bp(a,2’)) : 


Yk 
def 
succr(z,2') = VAN T(z, y) CA da(y,y') > T(x',y ’)) 
y=yı y'=yı 


We can then define that £i+ı follows x;: 


n—i1 
\ [Sp (zi, Vi41) A sucer (xi, 2i41)| (4) 
i=1 

And, x, has a jump-back to a previously seen state: 


n 


VV [őp(£n, £i) A sucer (tn, zi)| (5) 


i=l 


e Relational state predicates are fulfilled by simulation. Everything 
relating in the simulation fits the relational predicate, defined as a function 
Pred of two sets of labels (we use Lg(y) to represent the set of labels on the 
y-encoded state in Kg, similarly, Lp(x) for the z-encoded state in Kp): 


n ek 
AN Ti, yj) > Pred(Lp(z:), Ley) (6) 


We use Yeh * for the SAT formula that results of conjoining (1)-(6) for bounds n 
and k. If our is satisfiable, then there exists SIMeq from Kp to Kg. 


4.2 Encodings for AE-Simulation 


Our goal now is to find a set of states So C Sg that is able to simulate all states in 
Kp. Therefore, as in the previous case, the state space Kp corresponding to the 
Y quantifier will be explored exhaustively, and so n = |K p|, while k is the number 
of states in Kg, which increases in every iteration. As we have explained, this 
allows finding a small subset of states in Kg which suffices to simulate all states 
of Kp (Note that here we guarantee soundness but not necessarily completeness, 
which will be further explained in Section 4.3). 


e All states in the simulation are legal states. Again, every state guessed 
in the simulation is a legal state from Kp or Kg: 


n k 
N Ke) A Kolus) a’) 
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e Kp is exhaustively explored. Every two different indices in the states of 
Kp are different states*: 


N (Kelt) A Kp(ar)) > (x: # ar) (2) 
ixzr 
e All initial states in Kp must match with some initial state in Ko. 


Note that, contrary to the 3V case, here the initial state in Kg may be 
different for each initial state in Kp: 


n k 


A V a) > (ayy) AT (ei, y)) (3) 


i=1 j=1 


e For every pair in the simulation, each successor in Kp must match 
with some successor in Kg. For each (#;,y;) in the simulation, every 
successor state of x; has a matching successor state of yj: 


k k 


N A Op (zy, £4) > \ [Tisus) > = (40( Yj, Ur )A Ten y))] (£) 


i=1t=1 j=l r=1 


e Relational state predicates are fulfilled. Similarly, all pairs of states in 
the simulation should respect the relational Pred: 


n k 


A A TEn y) > Pred(Le(z:), Lo(yy)) (5°) 


i=1 j=1 


We now use oe for the SAT formula that results of conjoining (1’)-(5’) for 


bounds n and k. If ge is satisfiable, then there exists SIMae from Kp to Kg. 


4.3 Encodings for AE-Simulation with Prophecies 


The AE-simulation encoding introduced in Section 4.2 is sound but not complete 
(i.e., the property is satisfied, yet no simulation exists). For example, when the 
system for the V quantifier is non-deterministic, the simulation is required to 
match immediately the successor of the 4 path without inspecting the future 
of the V path. In this section, we incorporate our encodings with prophecies to 
resolve these kind of cases, which takes us one step towards completeness. We 
now illustrate with the following example. 


Example 1. Consider Kripke structures Ky and Kə from Section 1, and HyperLTL 
formula yg = Va.dr’. O(ar © az). It is easy to see that the two models satisfy 
p2, since mapping the sequence of states (518283) to (q1q2q4) and (815284) to 
(q19395) guarantees that the matched paths satisfy O(a; © ar). However, the 
technique in Section 4.2 cannot differentiate the occurrences of s2 in the two 


different cases. 


t As in the previous case, we could fix an enumeration of the states of Sp and fix 
Xx... to be the states according to the enumerations. 
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Fig. 1: Prophecy automaton for OOa (left) and its composition with Ky (right). 


To solve this, we incorporate the notion of prophecies to our setting. Prophecies 
have been proposed as a method to aid in the verification of hyperliveness [14] 
(see [7] for a systematic method to construct prophecies). For simplicity, we 
restrict here to prophecies expressed as safety automata. A safety prophecy over 
AP is a Kripke structure U = (S, 9°, 8, AP, L), such that Traces(U) = AP”. The 
product K xU of a Kripke structure K with a prophecy U preserves the language 
of K (since the language of U is universal). Recall that in the construction of 
the product, states (s,u) € (K x U) that have incompatible labels are removed. 
The direct product can be easily processed by repeatedly removing dead states, 
resulting in a Kripke structure K’ whose language is Traces(K') = Traces(K). 
Note that there may be multiple states in K’ that correspond to different states 
in K for different prophecies. The prophecy-enriched Kripke structure can be 
directly passed to the method in Section 4.2, so the solver can search for a SIMae 
that takes the value of the prophecy into account. 


Example 2. Consider the prophecy automaton shown in Fig. 1 (left), where all 
states are initial. Note that for every state, either all its successors are labeled 
with a (or none are), and all successors of its successors are labeled with a 
(or none are). In other words, this structure encodes the prophecy OOa. The 
product K{ of Kı with the prophecy automaton U for OOa is shown in Fig. 1 
(right). Our method can now show that (Ki, K2) = y2, since it can distinguish 
the two copies of sı (one satisfies OOa and is mapped to (q1q2q4), while the 
other is mapped to (q1q3q5)). 


5 Implementation and Experiments 


We have implemented our algorithms using the SAT solver Z3 through its python 
API Z3Py [15]. The SAT formulas introduced in Section 4 are encoded into the 
two scripts simEA.py and simAE.py, for finding simulation relations for the 
SiMe and SlMae cases, respectively. We evaluate our algorithms with a set of 
experiments, which includes all forms of quantifiers with different sizes of given 
models, as presented earlier in Table 1. Our simulation algorithms benefit the 
most in the cases of the form Vsma11 pig. When the second model is substantially 
larger than the first model, SIMae is able to prove that a V4 hyperproperty 
holds by exploring only a subset of the second model. In this section, besides 
Vsma11 Toig Cases, we also investigate multiple cases on each category in Table 1 to 
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demonstrate the generality and applicability of our algorithms. All case studies 
are run on a MacBook Pro with Apple M1 Max chip and 64 GB of memory. 


5.1 Case Studies and Empirical Evaluation 


Conformance in Scenario-based Programming. In scenario-based pro- 
gramming, scenarios provide a big picture of the desired behaviors of a program, 
and are often used in the context of program synthesis or code generation. A 
synthesized program should obey what is specified in the given set of scenarios 
to be considered correct. That is, the program conforms with the scenarios. The 
conformance check between the scenarios and the synthesized program can be 
specified as a Va-hyperproperty: 


Yeonf = Varadan’. \ O (pr ba Dr’), 
pEAP 


where 7 is over the scenario model and 7’ is over the synthesized program. That 
is, for all possible runs in the scenarios, there must exists a run in the program, 
such that their behaviors always match. 

We look into the case of synthesizing an Alternating Bit Protocol (ABP) from 
four given scenarios, inspired by [3]. ABP is a networking protocol that guar- 
antees reliable message transition, when message loss or data duplication are 
possible. The protocol has two parties: sender and receiver, which can take 
three different actions: send, receive, and wait. Each action also specifies which 
message is currently transmitted: either a packet or acknowledgment (see [3] for 
more details). The correctly synthesized protocol should not only have complete 
functionality but also include all scenarios. That is, for every trace that appears 
in some scenario, there must exist a corresponding trace in the synthesized pro- 
tocol. By finding SIMae between the scenarios and the synthesized protocols, 
we can prove the conformance specified with Ycon¢. Note that the scenarios are 
often much smaller than the actual synthesized protocol, and so this case falls 
in the Vsma11 Joig category in Table 1. We consider two variations: a correct and 
an incorrect ABP (that cannot handle packet loss). Our algorithm successfully 
identifies a SIMae that satisfies Yconf for the correct ABP, and returns UNSAT 
for the incorrect protocol, since the packet loss scenario cannot be simulated. 


Verification of Model Translation. It is often the case that in model trans- 
lation (e.g., compilation), solely reasoning about the source program does not 
provide guarantees about the desirable behaviors in the target executable code. 
Since program verification is expensive compared with repeatedly checking the 
target, alternative approaches such as certificate translation [4] are often pre- 
ferred. Certificate translation takes inputs of a high-level program (source) with 
a given specification, and computes a set of verification conditions (certificates) 
for the low-level executable code (target) to prove that a model translation is 
safe. However, this technique still requires extra efforts to map the certificates 
to a target language, and the size of generated certificates might explode quickly 
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(see [4] for retails). We show that our simulation algorithm can directly show 
the correctness of a model translation more efficiently by investigating the source 
and target with the same formula Yconf used for ABP. That is, the specifications 
from the source runs 7 are always preserved in some target runs 7’, which infers 
a correct model translation. Since translating a model into executable code im- 
plies adding extra instructions such as writing to registers, it also falls into the 
Vsma11ı pig category in Table 1. 

We investigate a program from [4] that performs matriz multiplication (MM). 
When executed, the C program is translated from high-level code (C) to low- 
level code RTL (Register Transfer Level), which contains extra steps to read 
from/write to memories. Specifications are triples of (Pre, annot, Post), where 
Pre, and Post are assertions and annot is a partial function from labels to as- 
sertions (see [4] for detailed explanations). The goal is to make sure that the 
translation does not violate the original verified specification. In our framework, 
instead of translating the certification, we find a simulation that satisfies Yconf, 
proving that the translated code also satisfies the specification. We also investi- 
gate two variations in this case: a correct translation and an incorrect transla- 
tion, and our algorithm returns SAT (i.e., finds a correct SIMae simulation) in 
the former case, and returns UNSAT for the latter case. 


Compiler Optimization. Secure compiler // Source program S 
optimization aims at preserving input-output L1: if (j < arr_size) { 


behaviors of an original implementation and 12: a := arr[0]; 
a target program after applying optimization a } ee arr(jJ; 
techniques, including security policies. The LS: a im are TO): 


conformance between source and target pro- L6: b: 
grams guarantees that the optimizing proce- L7: } 

dure does not introduce vulnerabilities suchas // Target program T 
information leakage. Furthermore, optimiza- 4: a := arr[0]; 

tion is often not uniform for the same source, aa af a i t 
because one might compile the source to mul- ;4; Jelsi J2 

tiple different targets with different optimiza- g5: 

tion techniques. As a result, an efficient way L6: b := arr[arr_size - 1]; 
to check the behavioral equivalence between 17: } 

the source and target provides a correctness 
guarantee for the compiler optimization. 


arr[arr_size - 1]; 


Fig. 2: The common branch fac- 


; ea, ee . torization example [30]. 
Imposing optimization usually results in a nzanoncam pias 


smaller program. For instance, common branch factorization (CBF) finds com- 
mon operations in an if-then-else structure, and moves them outside of the con- 
ditional so that such operation is only executed once. As a result, for these 
optimization techniques, checking the conformance of the source and target falls 
in the Vpig dsmai1 category. That is, given two programs, source (big) and target 
(small), we check the following formula: 


Pse = Yr.3r'. (ing e ing) > O (out, e outy ). 
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In this case study we investigate the strategy CBF using the example in 
Figure 2 inspired by [30]. We consider two kinds of optimized programs for the 
strategy, one is the correct optimization, one containing bugs that violates the 
original behavior due to the optimization. For the correct version, our algorithm 
successfully discovered a simulation relation between the source and target, and 
the simulation relation returns a smaller subset of states in the second model 
(i.e., |SQ| < |Sq|). For the incorrect version, we received UNSAT. 


Robust Path Planning. In robotic planning, robustness plan- 
ning (RP) refers to a path that is able to consistently complete 
a mission without being interfered by the uncertainty in the en- 
vironment (e.g., adversaries). For instance, in the 2-D plane in 
Fig. 3, an agent is trying to go from the starting point (blue 
grid) to the goal position (green grid). The plane also contains 
three adversaries on the three corners other than the starting 
point (red-framed grids), and the adversaries move trying to 
catch the agent but can only move in one direction (e.g., clockwise). This is a 
dsmai1 Vig Setting, since the adversaries may have several ways to cooperate and 
attempt to catch the agent. We formulate this planning problem as follows: 


Fig. 3: A 
robust path. 


Prp = I7.Vn’. O (pos, & pos,,). 


That is, there exists a robust path for the agent to safely reach the goal regardless 
of all the ways that the adversaries could move. We consider two scenarios, one 
in which there exists a way for the agent to form a robust path and one does 
not. Our algorithm successfully returns SAT for case which the agent can form a 
robust path, and returns UNSAT for which a robust path is impossible to find. 


Plan Synthesis. The goal of plan synthesis (PS) is to synthesize a single com- 
prehensive plan that can simultaneously satisfy all given small requirements has 
wide application in planning problems. We take the well-known toy example, 
wolf, goat, and cabbage”, as a representative case here. The problem is as fol- 
lows. A farmer needs to cross a river by boat with a wolf, a goat, and a cabbage. 
However, the farmer can only bring one item with him onto the boat each time. 
In addition, the wolf would eat the goat, and the goat would eat the cabbage, 
if they are left unattended. The goal is to find a plan that allows the farmer 
to successfully cross the river with all three items safely. A plan requires the 
farmer to go back and forth with the boat with certain possible ways to carry 
different items, while all small requirements (i.e., the constraints among each 
item) always satisfied. In this example, the overall plan is a big model while the 
requirements form a much smaller automaton. Hence, it is a J»ig Vsmai1 problem 
that can be specified with the following formula: 


WwW 


n.r. O (action, & violationy ). 


Pps = 


5 https://en.wikipedia.org/wiki/Wolf , _goat_and_cabbage_problem 
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Type |Quants Cases ISP| ||SQ| |Z3 Outcome _|solve[s] 

ABP 11 |14 Isat So|=11 9.37 

y a ABPw/ bug 11 |14 Junsat - 9.46 
SIMag oe [MM 27 |27 jsat So|=27 67.74 
MMw/ bug 27 |27 |unsat - 66.85 

YV. 3 CBF 15 19 sat So|=8 3.49 

»ig -san [OBFw/ bug |15 J9 |unsat. |- 3.51 

3y, RP? 8 |9 [sat Spl=5 1.09 

SIMea Teml Mig RP 3 no sol. (8 9 unsat |- 1.02 
J. V GCW 16 l4 sat Sp|=8 3.36 

-Þig ‘small [OQCWnoso à 16 |4 lunsat l- 2.27 


Table 2: Summary of our case studies. The outcomes with simulation discovered 
show how our algorithms find a smaller subset for either Kp or Kg. 


5.2 Analysis and Discussion 


The summary of our empirical evaluation is presented in Table 2. For the VA 
cases, our algorithm successfully finds a set |SQ| < |Sq| that satisfies the prop- 
erties for the cases ABP and CBF. Note that case MM does not find a small 
subset, since we manually add extra paddings on the first model to align the 
length of both traces. We note that handling this instance without padding re- 
quires asynchornicity— a much more difficult problem, which we leave for future 
work. For the SV cases, we are able to find a subset of Sp which forms a single 
lasso path that can simulate all runs in Sg for all cases RP and GCW. We em- 
phasize here that previous BMC techniques (i.e., HyperQB) cannot handle most 
of the cases in Table 2 due to the lack of loop conditions. 


6 Conclusion and Future Work 


We introduced efficient loop conditions for bounded model checking of fragments 
of HyperLTL. We proved that considering only lasso-shaped traces is equivalent to 
considering the entire trace set of the models, and proposed two simulation-based 
algorithms SlMea and SIMae to realize infinite reasoning with finite exploration 
for HyperLTL formulas. To handle non-determinism in the latter case, we com- 
bine the models with prophecy automata to provide the (local) simulations with 
enough information to select the right move for the inner J path. Our algorithms 
are implemented using Z3py. We have evaluated the effectiveness and efficiency 
with successful verification results for a rich set of input cases, which previous 
bounded model checking approach would fail to prove. 

As for future work, we are working on exploiting general prophecy automata 
(beyond safety) in order to achieve full generality for the V4 case. The second 
direction is to handle asynchrony between the models in our algorithm. Even 
though model checking asynchronous variants of HyperLTL is in general unde- 
cidable [25,5,9], we would like to explore semi-algorithms and fragments with 
decidability properties. Lastly, exploring how to handle infinite-state systems 
with our framework by applying abstraction techniques is also another promis- 
ing future direction. 
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Abstract. There are two major techniques for scaling up stateless model 
checking: dynamic partial order reduction (DPOR), which only explores 
executions that differ in the ordering of racy accesses, and preemption 
bounding, which only explores executions containing up to k preemptions 
(preemptive context-switches). 

Combining these two techniques is challenging because DPOR-equivalent 
executions often contain a different number of preemptions, making it 
incorrect to cut explorations that exceed the preemption bound. To 
restore completeness, prior work has weakened the DPOR algorithm, 
which often results in the exploration of many redundant executions. 
We propose an alternative approach. Starting from an optimal DPOR algo- 
rithm, we achieve completeness by allowing some slack on the preemption- 
bound of the explored executions. We prove that the required slack does 
not exceed the number of threads of the program (minus two), and that 
this upper limit is tight. 


1 Introduction 


Stateless model checking (SMC) [12] is an effective bug-finding technique for 
concurrent programs that systematically explores all interleavings of the given 
input program. As such, it suffers from the state-space explosion problem: the 
number of possible interleavings of a program grows rapidly with the program 
size. There are two main approaches to attack this problem in the literature. 


Dynamic partial order reduction (DPOR) [11] is based on the idea that 
permutations of independent instructions in an interleaving lead to the same 
state. DPOR deems such interleavings equivalent and strives to explore only 
one representative interleaving from each equivalence class. 

Preemption bounding (PB, a.k.a. context bounding) [25] is based on the idea 
that concurrency bugs in practice can be exposed with a small number of 
preemptions [24]. Leveraging this insight, PB only explores the interleavings 
that arise with at most k preemptions (for some fixed k), thereby guaranteeing 
a partial coverage of the state space. 


Combining the two approaches is non-trivial. Simply modifying a DPOR algorithm 
to discard any explored executions that exceed the desired bound k is not complete, 
as executions with < k preemptions are missed. To restore completeness, Coons 
et al. [10] weaken DPOR by adding extra backtracking points, but such an 
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approach negates any optimality properties of the underlying DPOR algorithm, 
and can lead to the (redundant) exploration of multiple equivalent interleavings. 

In this paper, we propose a different approach. We adapt a state-of-the-art 
optimal DPOR algorithm with polynomial memory requirements called TruSt 
[16] to support preemption-bounded search. 

We first observe that the preemption-bound definition of Coons et al. [10] 
is overly pessimistic for incomplete executions (i.e., executions where at least 
one thread is enabled) in that an incomplete execution can often be extended 
to a complete one with a smaller preemption-bound. Updating the definition to 
be more optimistic, however, does not fully resolve the issue: an intermediate 
execution that exceeds the bound might still be needed in order to reveal a 
conflicting instruction that leads to the exploration of the desired execution. 

Our solution is to allow the exploration of executions exceeding the bound, as 
long as they only exceed it by a small amount, which we call slack. For programs 
with N > 2 threads, we show that a slack value of N — 2 suffices to maintain 
completeness (up to the provided bound). Unlike Coons et al. [10], our approach 
is optimal in the sense that it does not explore equivalent executions more than 
once. Although it may explore executions with larger bound than the desired 
one, we argue that these executions are useful, because they can still reveal bugs. 

We have implemented our bounding approach in GENMC [18], a state-of- 
the-art open-source stateless model checker. We show that for small preemption 
bounds (and despite the slack), bounded search can perform significantly faster 
than full search. Moreover, we experimentally confirm the literature observation 
that small bounds suffice to expose most concurrency bugs. We therefore argue 
that our combination of preemption bounding and DPOR is useful as a practical 
testing approach, which also provides certain coverage guarantees. 


2 Background 


In this section, we recall the basic DPOR approach and how prior work has tried 
to incorporate preemption-bounded search into it. Subsequently, we review the 
TruSt algorithm [16], which we later build upon to obtain our results. 


2.1 The Basics of Dynamic Partial Order Reduction 


DPOR starts by exploring one thread interleaving. In the process, it detects 
conflicting transitions, i.e., instructions that, if executed in the opposite order, 
will alter the state of the system. At each state, when an earlier transition t is in 
conflict with a possible transition t’ that can be taken by another thread in this 
state, DPOR considers the execution where t’ is fired before t. To accomplish 
this, DPOR adds the transition t’ to the backtrack set of the state immediately 
before t was fired, to be explored later. 
We illustrate DPOR by running it on the following example (Fig. 1). 


(RR+WW) 
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Fig. 1. Left-to-right DPOR exploration of RR+WW 


After firing the transitions (r+) and (r,) (trace @), DPOR adds transition (w1) 
to the backtrack set of the state after the firing of transition (r+), since transition 
(w1) is in conflict with transition (ry). When the initial exploration is finished 
(trace @)), DPOR backtracks to @ and considers the second exploration option, 
i.e., firing transition (w1) and thus reaching @). 

Subsequently, DPOR fires (r,) (trace @) and notices that this is in conflict 
with (w2); it then adds (w2) as an alternative exploration option for the state 
before the firing of (r,) in @. Again, DPOR finishes with the exploration where 
the read instruction reads the value 1 (trace @)) and backtracks to @). Now, (we) 
is fired (trace ©) and the algorithm continues with the remaining transition, 
leading to @. DPOR now terminates since there is no other exploration option. 

This way, DPOR manages to explores all three equivalence classes (represen- 
tatives ©, ©, D) of the 6 interleavings that correspond to this program. 


2.2 Bounded Partial Order Reduction 


Preemption bounding (PB) [25] prunes the state space by discarding executions 
that contain more preemptions than a given constant bound k. A preemption 
occurs at index i of a sequence of events r whenever (1) events 7; and 741 
originate from different threads and (2) the thread of 7; remains enabled after 7;; 
in particular, 7; is not the last event of its thread. 

Combining DPOR and PB is non-trivial. Specifically, simply pruning from 
DPOR’s exploration space any trace with more than k preemptions is incorrect 
because their exploration might lead to exploring traces with up to k preemptions. 

To see this, consider the run of RR-+Ww with k = 0. DPOR reaches the state 
where (r,,) is fired and (w1) is considered as an alternative option in the backtrack 


88 I. Marmanis et al. 


set. Firing transition (w1) will lead to trace @), which exceeds the bound, since 
there is a transition from the second thread present, while the first thread is still 
enabled. By discarding this state, the execution where b = 2 (which is equivalent 
to @) would never be considered, even though it respects the bound. 

To address this issue, Coons et al. [10] conservatively add more backtrack 
points accounting for such bound-induced dependencies. Concretely, when the 
two transitions of the first thread are fired (trace @), Coons et al. [10] adds (w1) 
in the backtrack set not only of the state before the firing of (ry) in @), as in 
the unmodified DPOR algorithm, but also of the initial state. Additionally, the 
initial transition from a state is always picked so that it is from the same thread 
as of the last fired transition, if possible. As a result, when the state with only 
(w1) being fired is reached (due to the additional backtrack point), (w2) will be 
fired immediately afterwards, and eventually the interleaving that corresponds 
to the right-to-left execution of the threads will be explored. 

While this solution guarantees that no execution within the bound is lost, 
it weakens DPOR, i.e., it leads to the exploration of equivalent interleavings 
that would otherwise not be considered. In RR+Www, for k > 0, Coons et al. [10] 
explore interleavings that only differ in the order of (rz) and (w1). 


2.3 TruSt: Optimal Dynamic Partial Order Reduction 


The basic DPOR algorithm described in § 2.1 does not guarantee optimality, i.e., 
that only one execution from each equivalent class will be explored. There are 
several improvements of the basic algorithm, some of which achieve optimality 
(e.g., [2, 17]). Here, we follow the most recent such improvement, TruSt [16], 
which achieves optimality with polynomial memory consumption. 

TruSt represents program executions as execution graphs, a concept that 
appeared in previous works for DPOR under weak memory models [15, 17]. An 
execution graph G consists of a set of nodes G.E (a.k.a. events) representing the 
individual thread instructions executed, such as read events R and write events W, 
and three kinds of directed edges encoding the ordering between events: 


— the program order G.po, which orders events of the same thread; 
— the coherence order G.co, which orders writes to the same location; and 
— the reads-from mapping G.rf, which shows where each read is reading from. 


For an execution graph G, we define the following derived relations: 


G.porf = (G.poU {(G.rf(r),r)|r € Gri)" (causality order) 
G.tr = {(r,w)| (G.rf(r), w) € G.co} (reads-before) 


The causality order, porf, relates two events if there is a path of program order 
or read-from dependencies between them, while fr orders a read event before 
every write that is coherence after the one read by the read. 

An execution graph is SC-consistent (sequentially consistent) if there is a 
total ordering of its events respecting po such that each read event reads from 
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the immediately preceding same-location write in the total order. Equivalently, a 
graph is SC-consistent if porf U co U fr is acyclic. 

Execution graphs enable the efficient reversal of many conflicting events. If a 
write or a read event is in conflict with a previous write event, there is no need 
to backtrack to the state before the write events is added. Instead, the new event 
can be directly added in the execution and either read from a co-earlier write 
in case of a read event, or be placed co-before the conflicting write in case of a 
write event. 

The only reversals where backtracking is necessary are those between a write 
event and a previously added read event: when a read event is added, it does not 
have the option to read from a write that has not yet been added. These reversals 
are referred to as backward revisits. To avoid exponential memory consumption, 
TruSt considers each exploration option eagerly when the new event is added, 
instead of maintaining backtrack sets for later exploration. In the case of backward 
revisits, TruSt removes the part of the execution that was added after the read 
event but is not in the prefix of the write event. The prefix of an event is defined 
as the set of events that precede it in the porf order. This allows the write event 
to be directly added in the execution graph. Because there is the possibility that 
many different execution graphs can lead to the same execution after a backward 
revisit, TruSt only considers the revisit if the events to be removed respect a 
mazimality condition which is defined in such a way so that there will always be 
exactly one such set of deleted events, achieving an optimal exploration. 


3 Bounded Optimal DPOR: Obstacles 


We discuss the two main obstacles that complicate the application of preemption- 
bounded search to a DPOR algorithm. 


3.1 Pessimistic Bound Definition 


The first problem concerns the definition of preemptions for incomplete exe- 
cutions. Recall in the RR-+-WW example why the naive adaptation of DPOR 
with preemption bound k = 0 (incorrectly) does not generate the execution 
reading b = 2. The partial trace @) is discarded because it contains at least one 
preemption according to the definition of Musuvathi et al. [23]. (Both threads 
are enabled and have executed one instruction each.) 

We argue that this trace should be deemed to have no preemptions because 
of monotonicity. Trace @) can be extended to a full trace (namely, @) that (is 
equivalent to one that) does not have any preemptions. 

We therefore modify the definition of preemptions as follows. A preemption 
occurs at index 7 of an event sequence T whenever (1) events 7; and 7;41 originate 
from different threads and (2) the thread of 7; remains enabled after 7;, and has 
further events in the trace 7)417;+2-.. Tir]: According to our new definition, both 
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Fig. 2. A program and its intermediate execution that TruSt must explore in order to 
reach the right-to-left execution. 


interleavings that are equivalent with @) have zero preemptions, because when 
switching to another thread, the first thread has no further events in the trace. 

Our new definition satisfies monotonicity and coincides with the original on 
complete executions. We note, however, that partial executions with k preemptions 
cannot always be extended to a complete execution with k preemptions. Consider, 
for example, trace @ of RR+ww, which has no preemptions. Firing the only 
remaining transition leads to trace (6), which has one preemption. A DPOR 
algorithm that employs our definition of preemptions might thus reach states that 
are bound-blocked; the current explored execution respects the bound but there 
is no final execution reachable from this state that respects the bound. In our 
experience (see §6), bound-blocked executions do not seem to have a significant 
effect on the performance of our algorithm. 


3.2 Need For Slack 


Monotonicity alone is not enough to incorporate bounded search in an algorithm 
like TruSt, without still forfeiting completeness: some executions that respect the 
bound might still be lost. Intuitively, since DPOR algorithms operate by detecting 
conflicting instructions during an interleaving’s exploration and reversing the 
conflict to obtain a new interleaving, it might be the case that for the conflict to 
be revealed, an execution that exceeds the bound needs to be explored. 

We illustrate this point with the example in Fig. 2 where all the variables 
are initialized to zero. Consider a run of TruSt that always adds the next event 
from the left-most enabled thread. To reach the final execution that results 
from executing the threads from right to left, TruSt needs to pass through the 
execution depicted on the right of Fig. 2 before reaching this final execution. In 
the next step, the second write of the third thread will be added, which will 
reveal a conflict with the first read of y of the second thread. The algorithm will 
then perform a backward revisit, removing the events of the second thread after 
the first read of y, and change the read’s incoming rf edge to the new write 
event. The desired final execution will be reached after the remaining events of 
the second thread are added again. 
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It is easy to see that, while the final execution has zero preemptions, the 
depicted intermediate execution has at least one preemption, and would thus 
be discarded. This example can in fact be generalized by adding more threads 
identical to the third one; to reach the final right-to-left execution that has zero 
preemptions, TruSt must visit an execution that has at least N—2 preemptions, 
where N is the total number of threads. In §4, we show that this is in fact an 
upper limit; a final execution with k preemptions is always reachable through 
a sequence of executions that never exceed k + N — 2 preemptions. This result 
directly enables us to incorporate preemption-bounded search into TruSt by 
allowing some slack to the bound. 


4 Recovering Completeness via Slack 


Our bounded DPOR algorithm, BUSTER, can be seen in Algorithm 1, where we 
have highlighted the differences w.r.t. to TruSt [16]. 

We first discuss some additional notation used in the algorithm. First, each 
execution graph generated by the algorithm keeps track of the order <g in which 
events were added to it. Second, given a graph G and a set of events E, we write 
Gz for the restriction of G to E. Third, let G.cprefix(e) be the causal prefix of 
an event e in an execution graph G, i.e., the Ww of all Dae that causally precede 
it (including e itself). Formally, G. enti 8 fe | (e’,e) € G.porf*}. Fourth, a 
subscript loc(a) restricts a set of events - those that access the same location 
as event a. Fifth, the function SetRF(G,a,w) adds an rf edge from w to a and 
SetCO(G, wp, a) places a immediately after wp in co. Finally, we define the traces 
of an execution graph as the linearizations of (G-porf U G.co U G.fr) on G.E. 
We lift the definition of preemptions to an execution graph G: preemptions(G) 
is the minimum number of preemptions in the traces of G. 

Apart from only exploring SC-consistent executions, BUSTER eagerly discards 
executions with more preemptions than the user-provided value k plus the slack 
(Line 5). If both tests fail, BUSTER continues by picking an new event to extend 
the current execution (Line 6). For correctness, we fix nextp(G) to always return 
the event that corresponds to the left-most available thread. Depending on the 
type of the new event, the algorithm proceeds in a different way. We discuss the 
interesting cases of read and write events. 

If the new event a is a read event, BUSTER simply considers every possible 
write event as an rf option for a (Line 13), and eagerly explores the corresponding 
execution. If a is a write event, first every co placement is considered and explored 
(Line 15). Afterwards, BUSTER considers possible backward-revisits; for every 
read r event that is not in the causal prefix of a, the execution where r reads 
from a is considered, after deleting the events added after r, that are not in the 
causal prefix of a (Line 19). To avoid redundant revisits, only when the set of 
deleted events satisfies a maximality condition (Line 18), is the backward-revisit 
performed (see [16] for more details). 
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Algorithm 1 A Bounded DPOR algorithm based on TruSt [16] 


: procedure VERIFY(P, k) 
VISIT pi (Go ) 


1 
2 
3: procedure VISITp (G) 

4 if sconsistent(G) then return 

5: if preemptions(G) > k + N — 2 then return 
6 switch a <+ nextp(G) do 

T case a = L 

8 return “Visited full execution graph G” 


9: case a € error 

10: exit( “Visited erroneous execution graph G”) 
11: casea ER 

12; for w € G.Wioc(a) do 

13: VISIT p, (SetRF (G, a, w)) 

14: case a € W 

15: VisITCOsp,x.(G, a) 

16: for r € G.Rioc(a) \ G.cprefix(a) do 

17: Deleted ~ {e € G.E | r <q e} \ G.cprefix(a) 
18: if Ve € Deleted U {r}. ISMAXIMALLYADDED(G, e, a) then 
19: VISITCOSp,(SetRF(G|G.£\ Deleted; 7; @), a) 
20: case _ 
21: VISIT p,n(G) 


22: procedure VISITCOSp,x(G, a) 
23: for wp € G.Wioc(a) do VISITp,z (SetCO(G, wp, a)) 


4.1 Properties of TruSt 


We now present some key properties of the TruSt algorithm, i.e., Algorith 1 
without Line 5, that are used to prove BUSTER’s correctness (Theorem 1). 

From TruSt’s correctness argument, we know that every SC-consistent exe- 
cution Gy has exactly one sequence of VISITp calls that leads to it. We call the 
sequence of the corresponding graphs a production sequence for Gy. 

Given two SC-consistent graphs G and G”, we say that G is a prefix of G’, and 
write GC G’, if G'|ez = G. Intuitively, G is a prefix of G’ if we can construct 
G' from G, by adding the missing events in some order for some rf and co. 

Let a maximal step of an execution G be a execution that results from 
extending a thread of G by an event e in a mazimal way, i.e., if e € R, then e is 
made to read from the co-latest event and if e € W, then e is placed at the end 
of co. We write G —> G’ when G’ is a maximal step of G, and G >, G’ when 
G — G’ and e is the added event. We say that a sequence of maximal steps is 
non-decreasing when the sequence of the thread identifiers of the added events is 
non-decreasing. Finally, we write tid(e) for the thread identifier of an event e. 
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A key property of TruSt (stated in Prop. 1) is that every execution G in the 
production sequence of an SC-consistent execution Gp is either a prefix of Gy, or 
it contains a read event r that does not read from the “correct” write, but there 
is a prefix G of Gy that can by extended to G by a non-decreasing sequence of 
maximal steps starting with r and not including events of at least one thread to 
the right of r. 


Proposition 1. Let S be the production sequence of an SC-consistent final 
execution Gy, and G be an execution in S. Then, either G E Gy or there ex- 
ists an execution G, that is before G in S, a read event r = nextp(G,), a 
thread t > tid(r) and an execution G such that Gy CGC G5 |e, EUG} cprefix(r) 
Gy |G; .cprefix(G; -r£ (r)) Z G, there is a non-decreasing sequence of maximal steps s.t. 


G>,3* G, and Ye € GE \ ĜE. tid(e) £ t. 


Intuitively, TruSt tries to construct Gy by exploring an increasing sequence of 
its prefixes. This is not always possible, because when a read event r is added to 
G», the write event w that it should read from might not yet be present in Gy. In 
that case, r is made to read from another write and is later revisited by w leading 
to the execution G, = G f|G,. EUG; .cprefix(r), Which is a prefix of Gy. It is possible 
that additional backward revisit steps may happen between Gp and Gj. Due 
to maximality, however, for every intermediate execution G in the production 
sequence between G, and G4, there will be an execution Gy E GE Gi, that can 


be extended to G by a sequence of non-decreasing maximal steps. Execution G is 
exactly the part of G that is not deleted or revisited in a later step in S. Hence, 
if w is the first write that performed a backward revisit in S after G, then the 
events of thread t = tid(w) are already included in G. Finally, it can be shown 
that t is to the right of r. The formal proof of this proposition can be found in 
the extended version of this paper [22]. 


4.2 Correctness of Slacked Bounding 


To see why executions in the production sequence of a graph Gy can have at most 
preemptions(G;) + N — 2 preemptions, we start with a definition. A witness of 
a graph G is a trace of G that contains preemptions(G) preemptions. 

Next, we observe that preemptions are monotone w.r.t. execution prefixes. 
That is, if an execution G requires a certain number of preemptions to be 
produced, a larger execution G” 3 G requires at least that many preemptions. 


Lemma 1. If G,G’ are SC-consistent and G E G', then preemptions(G) < 
preemptions(G’). 


To prove this, take a witness of G’ and restrict to the events of G, thereby 
obtaining a witness of G. The restriction can only remove preemptions. 

Further, we note that the number of preemptions of an execution is unaffected 
if we extend its last executed thread with a maximal step; if a maximal step adds 
an event to a different thread, the number is increased by at most one. 
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Lemma 2. Let G and G’ be SC-consistent executions and r € G’.E such that 
G >,—* G'. Then, preemptions(G’) < preemptions(G) + S, where S is the 
number of threads that where extended to obtain G' from G. 


Proof. Consider a witness w of G and extend by appending the missing events in 
the same order they were added in the sequence of maximal steps. Notice that, 
by construction of the maximal step, the resulting sequencing is a trace of G”. 
Each time we add an event e in the trace, such that the last event of of the trace 
was not in the thread of e, we increase the preemption-bound by one: a thread 
was previously considered as completed, but was now extended with a new event. 
However, this can only happen S$ times: the maximal steps keep adding events 
of the same thread and when another thread is picked, the first is not extended 
again (the maximal steps are non-decreasing). This gives us a trace of G” with at 
most preemptions(G) + S preemptions, which concludes our proof. 


We can now prove that BUSTER is complete, i.e., it visits every full, SC- 
consistent execution that respects the bound. 


Theorem 1. VERIFY(P,k) visits every full, SC-consistent execution Gy of P 
with preemptions(Gy) < k. 


Proof. Consider a full, SC-consistent execution Gy of P with at most k pre- 
emptions. From the completeness of TruSt, we know that a run of Algorithm 1 
without the test on Line 5 will visit Gy. It thus suffices to show that for every 
execution G in the production sequence of Gy has at most k + N — 2 preemp- 
tions, where N is the number of threads of P. If G E Gp, then from Lemma 1 
preemptions(G) < preemptions(Gy) < k. 

Otherwise, from Prop. 1, there exists an execution G» that is before G 


in the production sequence of Gry and an execution G, such that Gp EGE 
GF Guat .cprefix(r)> next p(G) =r ER, GF |G; .cprefix(Gy-r£(r)) Z G, G >r>* G, 
and no events in G.E \ GE are in thread t, for some thread t to the right of r. 

From the last two properties and Lemma 2 we have preemptions(G) < 
k+N-—1 since it is preemptions(G) < preemptions(G) (G E Gp and Lemma 1) 
and at most N — 1 threads are extended from G to G. 

To complete the proof, we will prove that preemptions(G) = k+ N—1 
leads to contradiction. The equality implies that G had k preemptions and that 
N-—1 threads were extended in the maximal steps from G to G, and all of them 
increased the preemptions by one. The sequence of maximal steps from G to 
G is non-decreasing and starts with the thread of r. Since there are at most 
N threads, N—1 are extended, and at least one thread to the right of t is not 
extended, r is in the leftmost thread. 

Let t, be the leftmost thread, Gi, = Gy|a, BUG; .cprefix(r), and w = Gy.r£(r). 
From the proof of TruSt, we can infer that all events of G, are in the porf-prefix 
of the last event of tp. It is Gy |G; .cprefix(w Z Gy: the opposite, together with 


G, E GEG, contradicts Gy|a;, cprefix(w) Z G. Since Gy is in the production 
sequence of Gr, Ga E Gy, nextp(Go) = r, and Gé|G, cprefix(w) Z Go, TruSt will 
eventually add the write w = Gy.rf(r) and revisit the read r, reaching the 
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execution Gi, E Gy that contains all events added before r, i.e., the events of Gp, 
the events in the porf-prefix of r, and r. Hence, all events in G4.E\ {r} are in 
the porf-prefix of r, which implies that any witness of G4, ends with r. 

Since Gi, E Gy, any witness t of Gj, has at most k preemptions. Let G” be the 
execution G!, without r, and G” the unique execution s.t. G >, G”. Removing 
the last event r from t gives us a trace t’ of G” with at most k preemptions. If t’ 
ends with an event of tp, then we can restrict t’ to the events of G and add r at 
the end, obtaining a trace of G” with at most k preemptions. Otherwise, t does 
not end with an event of t,., and thus trace t has one more preemption than t, i.e., 
t has at most k — 1 preemptions. Then, we can again restrict t to the events of 
G and add r a the end, obtaining again a trace of G” with at most k preemptions. 
This contradicts our assumption that preemptions(G) = k and all N—1 threads 
that are extended from G increase the number of preemptions, since the first 
thread t, can be extended without incurring any more preemptions. 


BUSTER inherits TruSt’s optimality, as it only explores a subset of the execu- 
tions that TruSt does. Here, optimality refers to avoiding redundant work; due 
to the slack, VERIFY(P, k) can also visit executions more than k preemptions. 


Theorem 2. VERIFY(P, k) explores each graph G of a program P at most once. 


5 Implementation 


We have implemented BUSTER on top of the GENMC tool [18], which implements 
the TruSt algorithm [16]. Since GENMC supports weak memory models and 
the standard notion of preemption bounding only makes sense for sequential 
consistency, we enforce SC in our benchmarks by using only SC memory accesses 
and selecting GENMC’s RC11 model [20]. 

The bulk of our modifications to GENMC concern the checking of whether 
the preemption-bound of an execution G' exceeds a value k. Generally, deciding 
whether the preemption-bound of a Mazurkiewicz trace exceeds a value is an 
NP-complete problem [23]. We use an adaptation of the bound computation in 
Musuvathi et al. [23] to execution graphs, but instead of recursively computing 
preemptions(G) (and cache computations across calls to amortize the cost), we 
recursively compute the predicate 6(G,k) = preemptions(G) < k. The benefit 
of this method is that we can avoid calculating preemptions(G) exactly when 
its value exceeds the desired bound. Furthermore, there is no additional state 
that needs to be stored; BUSTER remains stateless. 

As an optimization, we use as slack (Line 5) the minimum between N—2 and 
the number of threads that have no deletable events; an event is not deletable if 
it is in the porf-prefix of a write that backward revisited. Intuitively, the events 
that are added in G to reach G (Prop. 1) are the events that will later be deleted 
to eventually reach a graph that is a prefix of the final graph Gy. 
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Table 1. Buggy benchmarks. An X indicates that an error was found. 


k=0 k=1 k=2 GENMC 
Benchmark Execs Time/Execs Time|Execs Time|Execs Time 
account-bad 3X 0.01 3X 0.01 3X 0.01 3X 0.01 
bluetooth-driver-bad 1 0.01 3X 0.02 7X 0.02 8xX 0.01 
circular-buffer-bad 2 0.07 13X 0.49 1X 0.03 1X 0.03 
din-phil-sat OX 0.01 OX 0.01 0X 0.01 OX 0.01 
fsbench-bad OX 0.93 OX 0.93 OX 0.94 Ox 1.01 
lazy01-bad 0x 0.01 OX 0.01 OX 0.01 OX 0.01 
queue-bad 20 1.91 56 X 27.47 2X 0.18 2X 0.19 
reorder-20-bad © © © © © © 10x 0.05 
stack-bad 11 0.44 10X 0.35 10X 0.35 10X 0.37 
token-ring-bad 12X 0.02 12X 0.02 12X 0.02 12X 0.02 
twostage-100-bad G © © © © © © © 
wronglock-bad 5914 164.46 2X 0.02 2X 0.02 2X 0.02 
lazy01-unsafe OX 0.01 OX 0.01 OX 0.01 OX 0.01 
sigma-unsafe OX 0.01 0x 0.01 0x 0.01 OX 0.01 
singleton-unsafe 5X 0.01 5X 0.01 5X 0.01 5X 0.01 
stateful01-1-unsafe Ox 0.01 OX 0.01 OX 0.01 OX 0.01 
triangular-2-unsafe 6 0.04 66 0.40) 368 2.06] 9069 X 29.44 
stack-2-unsafe 6 0.06 5X 0.05 5X 0.05 5X 0.05 
read-write-lock-2-unsafe 68 0.51 53X 0.25} 132X 0.59) 276X 0.96 
reorder-2 417 0.14 6X 0.01 2X 0.01 2X 0.01 


6 Evaluation 


To evaluate BUSTER, we answer the following questions: 


§6.1 How many preemptions suffice to expose common concurrency bugs? Is 
BUSTER effective at finding such concurrency bugs? 

§6.2 How good is preemption bounding at pruning the search space? Up to what 
bound does BUSTER run faster than vanilla DPOR? 

§6.3 What is the overhead induced by the bound calculation? 

§6.4 What is the overhead induced by bound-blocked executions? 


To that end, we evaluate BUSTER against GENMC on a diverse set of bench- 
marks. Unfortunately, we cannot include the approach of Coons et al. [10] in our 
comparison because their implementation is not available. 

We can draw two major conclusions from our evaluation. First, most bugs do 
manifest with a small number of preemptions (< 2), an observation that has been 
made in the literature before [25, 27]. Second, even though the bound calculation 
can be fairly expensive expensive, for small bounds BUSTER outperforms GENMC 
and can find bugs faster than GENMC. 


Experimental Setup We conducted all experiments on a Dell PowerEdge M620 
blade system with two Intel Xeon E5-2667 v2 CPU (8 cores @ 3.3 GHz) and 
256GB of RAM. We used LLVM 11.0.1 for GENMC and BUSTER. All reported 
times are in seconds. We set a timeout limit of 30 minutes. 


6.1 Bound and Bug Manifestation 


To validate that most bugs require a small number of preemptions, we run 
BUSTER and GENMC on three sets of benchmarks: 
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Table 2. Buggy CD benchmarks. An X indicates that the error was found. 


k=0 k=1 k=2 GENMC 
Benchmark Execs Time|Execs Time|Execs Time} Exec Time 
dglm-queue-bug(6) 48 X 2.55) 305 X 102.25) 810 X 272.71 © © 
dglm-queue-bug(7) 54X 3.94) 404 X 209.22| 1259 X 628.52 © © 
dglm-queue-bug(8) 60X 5.88| 517 X 393.02| 1854 X 1320.58 © © 
ms-queue-bug(6) 84 X 7.71| 1366 X 155.08| 9906 X 1057.28 © © 
ms-queue-bug(7) 103 X 12.87| 1936 X 294.76 © © © © 
ms-queue-bug(8) 124 X 20.72| 2636 X 530.04 © © © © 
bstack(7) 2 0.24 19X 1.26 83X 3.55 © © 
bstack(8) 2 0.34 22X 2.06) 111x 6.41 ® © 
bstack(9) 2 0.48 25X 3.23| 143X 10.95 © © 
msq-bug2(5) 2 0.09 18X 0.48] 154X — 2.69}37420 X 280.64 
msq-bug2(6) 2 0.12] 22X 0.87} 232X 6.29 ®© © 
stack-oe-bug(4) 77 0.64) 1086 17.77} 375X 9.66] 3523 X 97.65 
stack-oe-bug(5) 92 1.04) 1700 38.25} 663X  23.61]17032 X 763.96 
stack-oe-bug(6) 107 1.58| 2478 74.83) 1076 X 50.38 © © 
stack-oe-bug(7) 122 2.32| 3435 134.89| 1638 X 97.52 © © 


— the unsafe concurrent benchmarks of the SCT suite [27], 

— the unsafe benchmarks of the pthread category of SV-COMP [26] included 
in GENMC’s test suite, and 

— a set of concurrent data structures (CDs) from GENMC’s test suite with 
randomly induced bugs. 


In all cases, we configure BUSTER to disregard any errors that occur in executions 
that exceed the bound and are explored due to the slack. We note that this 
configuration may delay bug finding, since BUSTER may by chance quickly come 
across a buggy execution with more than & preemptions (due to slack) before 
finding any buggy execution with up to k preemptions. Nevertheless, we follow it 
to ensure that the bugs found arise in executions with up to the desired number 
of preemptions, so as to be able to validate the claim that bugs manifest in 
executions with a small number of preemptions. 

Table 1 reports our outcomes on the first two classes of benchmarks. As can 
be seen, BUSTER was able to find most bugs using a bound of 1. In fact, for 
most benchmarks, BUSTER found the bug before exploring a complete execution, 
hence the “0 X” entries in the table. The only benchmarks, where BUSTER needs 
a bound greater that 1 are the synthetic benchmarks triangular, which needs a 
bound of 8, as it was specifically designed to make the bug discovery difficult and 
push model checkers to their limits; reorder-20 and twostage-100, which have 
a large number of threads (20 and 100, respectively). BUSTER times out on the 
latter two benchmarks because the large number of threads put a lot of stress in 
the bound checking procedure. We note that for twostage-100, GENMC also 
fails to terminate within the time limit. 

Table 2 reports our results for our CD benchmarks. For these benchmarks, 
we have taken CD implementations from the GENMC test suite, and induced 
bugs into them by randomly dropping a synchronization instruction or replacing 
a CAS instruction with a normal write or an unconditional exchange instruction, 
thereby introducing a possible atomicity violation. We then construct medium- 
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Table 3. BUSTER and GENMC comparison on safe data structure benchmarks. 


k=0 k=1 k=2 k=3 GENMC_ |Max 
Benchmark Execs Time|Execs Time|Execs Time|Execs Time|Execs Time] k 
dglm-queue(6) 0.61 2 3.05 62 11.30) 162 27.14) 924 104.47) 7 
dglm-queue(7) 0.97 4 5.78 86 25.65) 266 71.73] 3432 570.68} 8 
ms-queue(6) 0.30 8 2.23) 128 8.46} 513 29.46)18564 321.58) 8 
) 1 
6 


N 


2 
2 

ms-queue(7 2 0.46 21 4.16} 177 18.53} 840 78.13 e] ® 

bstack2(8) 2 0.12 0.58} 114 2.97] 408 9.17|12870 159.27] 9 
bstack2(9) 2 0.15 8 0.88} 146 5.08) 594 17.75)48620 720.06) 8 
bstack(5) 2 0.12} 20 0.53} 92 2.98) 310 7.87) 4214 88.01) 8 
bstack(6) 2 0.18} 24 0.97| 134 6.84) 549 21.35)26040 787.64) 8 
ms-queue(7) 2 0.19 4 1.19} 86 5.77| 266 16.41) 3432 135.85) 7 
ms-queue(8) 2 0.26 6 1.85} 114 10.29} 408 33.78/12870 641.64] 8 
stack-oe(4) 77 0.64] 1098 17.62} 6208 139.81|23472 641.13 © © 

stack-oe(5) 92 1.06| 1713 39.55|11510 377.50 © © © © 

ms-oe(6) 12 0.27| 84 2.93) 615 18.82| 2039 57.58|10880 218.86| 5 
ms-oe(7) 14 0.34| 100 3.97) 800 27.42| 2855 91.54|20823 458.09| 5 
dglm-oe(7) 5 0.20} 29 2.14, 129 9.27| 238 19.53) 248 20.88| 3 
dglm-oe(8) 5 0.23} 31 2.62| 146 11.77| 294 26.33| 306 28.50) 3 
dglm-fifo(7) 26 4.50) 128 21.84| 128 25.93| 128 25.12) 128 22.92) 1 
dglm-fifo(8) 29 6.81| 162 35.43} 162 42.66] 162 41.59) 162 37.91) 1 
ttas-lock2(7) 2 0.12 14 0.48) 86 1.89) 266 4.57} 3432 28.50} 7 
ttas-lock2(8) 2 0.17} 16 0.81} 114 3.66} 408 10.14)12870 121.94) 8 
ttas-lock3(4) 21 0.89} 195 7.12} 1041 29.94) 3525 84.55)34650 387.36) 5 
ttas-lock3(5) 26 2.32| 320 23.97} 2274 130.62|10494 492.89 © © 


sized clients (with 2-3 threads and up to 12 operations per thread) of these data 
structures that check for their intended semantics (for example, that a queue has 
FIFO semantics). In all cases, the induced bugs lead to violations of the assertions 
in the client programs, and occasionally even to memory errors. BUSTER can find 
these bugs easily; a bound of k = 2 suffices to expose them. By contrast, GENMC 
times out for most of these benchmarks, as their state space is enormous. 


6.2 Comparison with Plain DPOR on Safe Benchmarks 


We have already seen that modulo specially crafted synthetic benchmarks, a small 
preemption bound is sufficient for finding bugs in practice. Moreover, BUSTER is 
pretty good at finding such bugs in concurrent data structures. We now evaluate 
the application of BUSTER on a collection of safe benchmarks. For this purpose, 
we use different variations of the benchmarks of Table 2 (after repairing them so 
that no assertion is violated), as well as a few locking benchmarks. 

Table 3 compares the performance of BUSTER for small values of k and 
GENMC. As it can be seen, GENMC struggles with these benchmarks, whereas 
BUSTER with k = 2 (and often also with k = 3) terminates fairly quickly. This is 
because only a small fraction of the total executions of sizeable benchmarks have 
few preemptions. Therefore restricting the search to only those executions makes 
BUSTER run much faster than GENMC, and guarantees that the program under 
consideration does not have any common bugs. 

In the last column of Table 3 we include the maximum value of k such that 
BUSTER terminates faster than GENMC, for the benchmarks that terminate 
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under GENMC. In most cases BUSTER is faster than GENMC even for k > 3. 
For the dglm-fifo benchmarks BUSTER is only faster for k € {0,1}, because for 
these benchmarks a small k suffices to fully explore the state space. 


6.3 Bound Calculation Overhead 


We now measure the cost of checking that each encountered execution is below 
the specified bound. As we discussed in §5, checking whether an execution graph’s 
preemption-bound exceeds a value is a NP-complete problem, and thus we expect 
this calculation to threaten the performance of our tool. 

To carefully account for this cost, we compare BUSTER against the baseline 
GENMC implementation on benchmarks where preemption bounding does not 
reduce the number of executions that are explored. In Line 4, we report results 
on simple CD clients that have only one operation per thread of the Treiber 
stack [28] and the TTAS lock [13]. The clients are designed so that BUSTER can 
explore the full set of program executions with a small bound k. We suffix the 
name of the benchmarks with the number of writer and reader threads for the 
Treiber stack and the total number of threads for TTAS. 

Column b contains the minimal number of the bound k for which BUSTER 
explores the same number of executions as GENMC does. Note that since these 
benchmarks contain several threads, exploration up to a certain bound (e.g., 
k = 0) does not mean that only executions with k preemptions are visited; due 
to slack, executions with more preemptions may be visited, and so it is possible 
for the exploration to cover the entire state space for a smaller bound than 
intrinsically necessary. In the subsequent columns we report the time overhead 
(percentage) for bounds k = b,k = b+ 1, and k = b+ 2 w.r.t. to GENMC’s 
execution time, which is visible on the last column. The maximum overhead is 
observed for k = b (the minimal value sufficient to cover the entire state space). 
This is expected because k = b places the most burden on the calculation of 
whether the number of preemptions in a given execution are below k. For larger 
k values, the overhead drops because it is easier to show that the number of 
preemptions are below the bound; one does not have to calculate the number of 
preemptions of an execution precisely. Overall, for the Treiber stack benchmark, 
the overhead introduced by calculating the bounds is fairly low and does not 
exceed the 23% of the execution time of GENMC. For the plain runs of ttas-lock, 
the maximal overhead is a bit larger, up to 38%. We note, however, that such 
overhead only occurs in clients with a large number of threads (7); smaller clients 
are not affected as much. 


6.4 Overhead due to Bound-Blocked Executions 


Finally, we measure the overhead caused by bound-blocked executions, by evalu- 
ating how often they arise in practice. Specifically, we ran BUSTER on GENMC’s 
test suite for various preemption-bound values, as well as on the safe CD clients 
used in §6.2, and counted the number of such bound-blocked executions. 
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Table 4. Overhead w.r.t. to GENMC (left) and blocking in benchmarks (right). 


Benchmark | b |k = b|k = b + 1|k = b + 2|GENMC # Blocked # Benchmarks 
treiber(6,0)| 0 | 10% 6% 4%| 30.81 0 72 
treiber(7,0)| 0 | 23% 12% 5%| 529.42 1 143 
treiber(3,2)| 1 | 6% 5% 5% 2.75 2 45 
treiber(3,3)| 1 | 7% 6% 5%| 31.15 3 3 
treiber(3,4)| 1 | 13% 8% 6%| 332.76 4 14 
treiber(4,2)| 1 9% 7% 5%| 47.50 2 : 
treiber(4,3)| 2 | 10% 7% 5%| 777.44 ; F 
ttas-lock(6)| 0 | 20% 13% 11%| 14.52 

é >8 6 
ttas-lock(7)| 0 | 38% 25% 16%] 231.91 


For GENMC’s test suite, the results are summarized in table 4 (right). We 
have restricted out attention to the runs with at least 10 executions, so that 
our results are not skewed by benchmarks that have very few executions. We 
have also excluded 8 benchmarks from the test suite that use barriers because 
they are currently not supported by our tool. As it can be seen, bound-blocked 
executions are rare: most runs lead to one bound-blocked execution, and only 6 
lead to more than 8 bound-blocked executions. Bound-blocked executions are on 
average no more than 6% of the total number of executions explored. 

For the CDs clients, bound-blocked executions are even more rare; out of the 
22 clients, BUSTER encounters bound-blocked executions in only 4 of them, for 
some k. We exclude again from the discussion runs with very few executions. From 
the remaining runs, only two encounter a considerable number of bound-blocked 
executions that become negligible as the bound is increased: around 10% for 
k = 1 and less than 1% for k = 2 


7 Related Work 


There is a large body of work that has improved the original DPOR algorithm 
of Flanagan et al. [11]. Abdulla et al. [2] introduced the first optimal DPOR 
algorithm, which, however, suffers from possibly exponential memory consumption. 
Kokologiannakis et al. [16] developed TruSt, which is the first optimal DPOR 
algorithm that consumes polynomial memory. 

Agarwal et al. [6], Chalupa et al. [8], Chatterjee et al. [9], and Huang [14] 
have extended DPOR for partitions coarser than the one we have focused in this 
paper, i.e., Mazurkiewicz traces. Abdulla et al. [1, 4, 5] consider DPOR under 
various weak memory models, while the works of Kokologiannakis et al. [16, 17, 
19] provide a DPOR algorithm that is parametric in the choice of the memory 
model, provided it respects some basic properties. 

Qadeer et al. [25] showed the decidability of context-bound verification of 
concurrent boolean programs. Musuvathi et al. [24] propose iterative context 
bounding, a search algorithm that prioritizes executions with fewer preemptions. 
Musuvathi et al. [23] combine partial-order reduction with a preemption-bound 
search, and prove that judging whether the preemption-bound of a Mazurkiewicz 
trace exceeds a certain value is an NP-complete problem. 
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To our knowledge, the only attempt to combine DPOR and preemption 
bounding is by Coons et al. [10], who identify the difficulty of maintaining 
completeness of the exploration, and resolve it by weakening DPOR. 

Abdulla et al. [3] and Atig et al. [7] have extended the notion of preemption 
bounding to weak memory models. We leave a possible extension of our approach 
to weak memory models for future work. 


Acknowledgments We thank the anonymous reviewers for their valuable feed- 
back. This work has received funding from the European Research Council (ERC) 
under the European Union’s Horizon 2020 research and innovation programme 
(grant agreement No. 101003349). 


8 Data-Availability Statement 


All supplementary material is available at [22]. The artifact is also available at 
[21]. 


References 


[1] Parosh Aziz Abdulla, Stavros Aronis, Mohamed Faouzi Atig, Bengt Jonsson, 
Carl Leonardsson, and Konstantinos Sagonas. “Stateless model checking 
for TSO and PSO”. In: TACAS 2015. Vol. 9035. LNCS. Berlin, Heidelberg: 
Springer, 2015, pp. 353-367. DOI: 10.1007/978-3-662-46681-0_28. URL: 
http://dx.doi.org/10.1007/978-3-662-46681-0_28. 

[2] Parosh Aziz Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos 
Sagonas. “Optimal dynamic partial order reduction”. In: POPL 2014. New 
York, NY, USA: ACM, 2014, pp. 373-384. DOT: 10. 1145/2535838 . 2535845. 
URL: http://doi.acm. org/10.1145/2535838 . 2535845. 

[3] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Ahmed Bouajjani, and Tuan 
Phong Ngo. “Context-Bounded Analysis for POWER”. In: TACAS 2017. 
Ed. by Axel Legay and Tiziana Margaria. Berlin, Heidelberg: Springer Berlin 
Heidelberg, 2017, pp. 56-74. ISBN: 978-3-662-54580-5. DOI: 10.1007/978- 
3-662-54580-5_4. 

[4] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, and Carl 
Leonardsson. “Stateless model checking for POWER”. In: CAV 2016. 
Vol. 9780. LNCS. Berlin, Heidelberg: Springer, 2016, pp. 134-156. DoT: 
10.1007/978-3-319-41540-6_8. URL: https://doi.org/10.1007/978- 
3-319-41540-6_8. 

[5] Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, and Tuan 
Phong Ngo. “Optimal stateless model checking under the release-acquire 
semantics”. In: Proc. ACM Program. Lang. 2.OOPSLA (Oct. 2018), 135:1- 
135:29. ISSN: 2475-1421. DOI: 10.1145/3276505. URL: http://doi.acm. 
org/10.1145/3276505. 


102 


[12] 


[13] 


[14] 


[15] 


[16] 


I. Marmanis et al. 


Pratyush Agarwal, Krishnendu Chatterjee, Shreya Pathak, Andreas Pavlo- 
giannis, and Viktor Toman. “Stateless Model Checking Under a Reads- 
Value-From Equivalence”. In: CAV 2021. Ed. by Alexandra Silva and 
K. Rustan M. Leino. Cham: Springer International Publishing, July 2021, 
pp. 341-366. ISBN: 978-3-030-81685-8. DOI: 10.1007/978-3-030-81685- 
S16. 

Mohamed Faouzi Atig, Ahmed Bouajjani, and Gennaro Parlato. “Context- 
Bounded Analysis of TSO Systems”. In: FPS 2014. Ed. by Saddek Bensalem, 
Yassine Lakhneck, and Axel Legay. Berlin, Heidelberg: Springer Berlin 
Heidelberg, 2014, pp. 21-38. ISBN: 978-3-642-54848-2. DOI: 10.1007/978- 
3-642-54848-2_2. 

Marek Chalupa, Krishnendu Chatterjee, Andreas Pavlogiannis, Nishant 
Sinha, and Kapil Vaidya. “Data-centric dynamic partial order reduction”. 
In: Proc. ACM Program. Lang. 2.POPL (Dec. 2017), 31:1-31:30. ISSN: 
2475-1421. DOI: 10.1145/3158119. URL: http://doi.acm.org/10.1145/ 
31538119. 

Krishnendu Chatterjee, Andreas Pavlogiannis, and Viktor Toman. “Value- 
Centric Dynamic Partial Order Reduction”. In: Proc. ACM Program. Lang. 
3.0OPSLA (Oct. 2019). DOI: 10.1145/3360550. URL: https://doi.org/ 
10.1145/3360550. 

Katherine E. Coons, Madan Musuvathi, and Kathryn S. McKinley. “Bounded 
Partial-Order Reduction”. In: OOPSLA 2013. Indianapolis, Indiana, USA: 
ACM, 2013, pp. 833-848. ISBN: 9781450323741. DOI: 10.1145/2509136. 
2509556. URL: https: //doi.org/10.1145/2509136 . 2509556. 

Cormac Flanagan and Patrice Godefroid. “Dynamic partial-order reduction 
for model checking software”. In: POPL 2005. New York, NY, USA: ACM, 
2005, pp. 110-121. DOI: 10. 1145/1040305 . 1040315. URL: http://doi. 
acm. org/10.1145/1040305. 1040315. 

Patrice Godefroid. “Model checking for programming languages using 
VeriSoft”. In: POPL 1997. Paris, France: ACM, 1997, pp. 174-186. Dot: 
10.1145/263699 . 263717. URL: http://doi.acm. org/10.1145/263699. 
263717. 

Maurice Herlihy and Nir Shavit. The art of multiprocessor programming. 
2008. 

Jeff Huang. “Stateless model checking concurrent programs with maximal 
causality reduction”. In: PLDI 2015. New York, NY, USA: ACM, 2015, 
pp. 165-174. DOI: 10. 1145/2737924 . 2737975. URL: http: //doi.acm. 
org/10.1145/2737924 . 2737975. 

Michalis Kokologiannakis, Ori Lahav, Konstantinos Sagonas, and Viktor 
Vafeiadis. “Effective stateless model checking for C/C++ concurrency”. 
In: Proc. ACM Program. Lang. 2.POPL (Dec. 2017), 17:1-17:32. ISSN: 
2475-1421. Dor: 10.1145/3158105. URL: http://doi.acm.org/10.1145/ 
3158105. 

Michalis Kokologiannakis, Iason Marmanis, Vladimir Gladstein, and Viktor 
Vafeiadis. “Truly stateless, optimal dynamic partial order reduction”. In: 


[17] 


[18] 


[19] 


[20] 


21 


22 


23 


[24] 


[25] 


[26] 


[27] 


Reconciling Preemption Bounding with DPOR 103 


Proc. ACM Program. Lang. 6.POPL (Jan. 2022). DOI: 10.1145/3498711. 
URL: https://doi.org/10.1145/3498711. 

Michalis Kokologiannakis, Azalea Raad, and Viktor Vafeiadis. “Model 
checking for weakly consistent libraries”. In: PLDI 2019. New York, NY, 
USA: ACM, 2019. DOI: 10.1145/3314221 .3314609. 

Michalis Kokologiannakis and Viktor Vafeiadis. “GenMC: A model checker 
for weak memory models”. In: CAV 2021. Ed. by Alexandra Silva and 
K. Rustan M. Leino. Vol. 12759. LNCS. Springer, 2021, pp. 427—440. DOI: 
10.1007/978-3-030-81685-8_20. 

Michalis Kokologiannakis and Viktor Vafeiadis. “HMC: Model checking for 
hardware memory models”. In: ASPLOS 2020. ASPLOS ’20. Lausanne, 
Switzerland: ACM, 2020, pp. 1157-1171. ISBN: 9781450371025. DOI: 10. 
1145 / 3373376 . 3378480. URL: https: //doi. org/10. 1145/ 3373376 . 
3378480. 

Ori Lahav, Viktor Vafeiadis, Jechoon Kang, Chung-Kil Hur, and Derek 
Dreyer. “Repairing sequential consistency in C/C++11”. In: PLDI 2017. 
Barcelona, Spain: ACM, 2017, pp. 618-632. ISBN: 978-1-4503-4988-8. DOI: 
10 . 1145 / 3062341 . 3062352. URL: http: //doi.acm. org/10.1145/ 
3062341 . 3062352. 

Iason Marmanis, Michalis Kokologiannakis, and Viktor Vafeiadis. “Recon- 
ciling Preemption Bounding with DPOR (artifact)”. In: (Apr. 2023). DOI: 
10.5281/zenodo.7505917. 

Iason Marmanis, Michalis Kokologiannakis, and Viktor Vafeiadis. “Recon- 
ciling Preemption Bounding with DPOR (supplementary material)”. In: 
(Apr. 2023). URL: https://plv.mpi-sws.org/genmc. 

Madalan Musuvathi and Shaz Qadeer. Partial-Order Reduction for Context- 
Bounded State Exploration. Tech. rep. MSR-TR-2007-12. Microsoft Re- 
search, 2007. URL: https: //www.microsoft .com/en-us/research/wp- 
content/uploads/2016/02/tr-2007-12.pdf. 

Madanlal Musuvathi and Shaz Qadeer. “Iterative Context Bounding for 
Systematic Testing of Multithreaded Programs”. In: PLDI 2007. San Diego, 
California, USA: ACM, 2007, pp. 446-455. ISBN: 9781595936332. DOI: 
10 . 1145/1250734 . 1250785. URL: https://doi.org/10.1145/1250734. 
1250785. 

Shaz Qadeer and Jakob Rehof. “Context-Bounded Model Checking of 
Concurrent Software”. In: TACAS 2005. Ed. by Nicolas Halbwachs and 
Lenore D. Zuck. Vol. 3440. LNCS. Springer, 2005, pp. 93-107. DOI: 10. 
1007/978-3-540-31980-1\_7. URL: https://doi.org/10.1007/978-3- 
540-31980-1%5C_7. 

SV-COMP. Competition on Software Verification (SV-COMP). 2019. URL: 
https: //sv-comp.sosy-lab.org/2019/ (visited on 03/27/2019). 

Paul Thomson, Alastair F. Donaldson, and Adam Betts. “Concurrency 
testing using schedule bounding: an empirical study”. In: PPoPP 2014. 
ACM, 2014, pp. 15-28. DOI: 10. 1145/2555243 . 2555260. URL: https: 
//doi.org/10.1145/2555243 . 2555260. 


104 I. Marmanis et al. 


[28] R. Kent Treiber. Systems Programming: Coping with Parallelism. Tech. rep. 
Technical Report RJ5118, IBM, 1986. URL: https: //dominoweb. draco. 
res.ibm.com/58319a2ed2b1078985257003004617ef .htm1. 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Check for 
updates. 


Optimal Stateless Model Checking for Causal 
Consistency 


Parosh Abdulla!® , Mohamed Faouzi Atig!®, S. Krishna? Ô, 
Ashutosh Gupta?, and Omkar Tuppe?™ 


1 Uppsala University, Uppsala, Sweden 
{parosh,mohamed_faouzi.atig}@it.uu.se 
2 IIT Bombay, Mumbai, India 
{krishnas,akg,omkarvtuppe}@cse.iitb.ac.in 


Abstract. We present a framework for efficient stateless model checking 
(SMC) of concurrent programs under three prominent models of causal 
consistency, CCv, CM, CC. Our approach is based on exploring traces under 
the program order po and the reads from rf relations. Our SMC algo- 
rithm is provably optimal in the sense that it explores each po and rf re- 
lation exactly once. We have implemented our framework in a tool called 
CONSCHECKER. Experiments show that CONSCHECKER performs well in 
detecting anomalies in classical distributed databases benchmarks. 


1 Introduction 


Traditionally, distributed shared memories ensure that all processes in the sys- 
tem agree on a common order of all operations on memory. Such guarantees are 
provided by sequential consistency (SC) [33], and by linearizable memory [26]. 
However, providing these consistency guarantees entails access latencies, making 
them inefficient for large systems. There is a tradeoff in providing strong con- 
sistency guarantees while ensuring low latency and this presents significant effi- 
ciency challenges. There is a large body of work which suggests that a systematic 
weakening of memory consistency can reduce the costs of providing consistency. 
Weakened consistency guarantees admit more concurrent behaviours than SC 
or linearizability. To this end, Lamport [32] proposed causal consistency which 
provides an ordering among events in a distributed system in which processes 
communicate via message passing. This has been adapted |7] to a setting of 
reads and writes in a shared memory environment. In this setting, the return 
values of reads must be consistent with causally related reads and writes. As 
causality only orders events partially, the reading processes can disagree on the 
relative ordering of concurrent writes. This makes concurrent writer processes 
independent, reducing the costs of synchronization. 

Several efforts have been made to formalize causal consistency [16], [25], [39] 
[40], [7], [15], [10], [8], [38] and there are many implementations [9], [20], [21] 
satisfying this criterion as opposed to strong consistency (linearizability). 
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While strong consistency makes it easier to program than weak ones, they 
require costly implementations. Weak memories may be easier to implement, 
but much harder to program. An acceptable medium which has emerged over 
the years are three important notions in causal consistency, respectively causal 
consistency (CC) [15], [25], causal convergence (CCv) [16], [39], [15], [25] and 
causal memory (CM) [7], [39], [15], [25]. 


The focus of this paper is the verification of shared memory programs under 
causal consistency. We consider the three variants mentioned above. We pro- 
pose a stateless model checking (SMC) framework that covers all three variants. 
SMC is a successful technique for finding concurrency bugs [23]. For a termi- 
nating program, SMC systematically explores all process schedulings that are 
possible during runs of the program. The number of possible schedulings grows 
exponentially with the execution length in SMC. To counter this and reduce the 
number of explored executions, the technique of partial order reduction [18,22] 
has been proposed. This has been adapted to SMC as DPOR (dynamic partial 
order reduction). DPOR was first developed for concurrent programs under SC 
[1,41]. Recent years have seen DPOR adapted to language induced weak memory 
models [28,37],[5], as well as hardware-induced relaxed memory models [3,46]. 
To the best of our knowledge, DPOR algorithms have not been developed for 
causal consistency models. The goal of this paper is to fill this gap. 

DPOR is based on the observation that two executions are equivalent if they 
induce the same ordering between conflicting events, and hence it is sufficient to 
consider one such execution from each equivalence class. Under sequential con- 
sistency, these equivalence classes are called Mazurkiewicz traces [34], while for 
relaxed memory models, the generalization of these are called Shasha-Snir traces 
[42]. A Shasha-Snir trace characterizes an execution of a concurrent program by 
the relations (1) po program order, which totally orders events of each process, 
(2), rf reads from, which connects each read with the write it reads from, (3) co 
coherence order, which totally orders writes to the same shared variable. DPOR 
can be optimized further by observing that the assertions to be verified at the 
end of an execution does not depend on the coherence order of shared variables, 
and hence it suffices to consider traces over po — rf. Based on this observation, 
the DPOR algorithms for programs under the release-acquire semantics (RA) 
and SC [5], [4] explores traces with po, rf and co where the co edges are added on 
the fly. The equivalence classes are considered wrt po — rf, reducing the number 
of distinct traces to be analyzed. 


Contributions. We propose a DPOR based SMC algorithm for all three con- 
sistency models CC, CCv,CM which explores systematically, all the distinct po-rf 
traces covering all possible executions of the program. We develop a uniform 
algorithm for all three models which is sound and complete : that is, all traces 
explored are consistent wrt the model X € {CC,CCv,CM} under consideration, 
and all such consistent traces are explored. Moreover, our algorithm is optimal 
in the sense that, each consistent po-rf trace is explored exactly once. One of the 
key challenges during the trace exploration is to maintain the consistency of the 
traces wrt the model under consideration. We tackle this by defining a trace se- 
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mantics which ensures that the traces generated in each step only contain edges 
which will be present in any consistent trace. We implement our algorithms in a 
tool CONSCHECKER which is, to the best of our knowledge, the first of its kind 
to perform SMC on the three prominent causal consistency models CC, CCv, CM. 
CONSCHECKER checks for assertion violation of programs under CC, CCv, CM. We 
evaluate the correctness of our tool on CC,CCv,CM by simulating these mod- 
els on the memory model simulator Herd [8] and validating our outcomes with 
theirs. Then we proceed with experimental evaluation on a wide range of bench- 
marks from distributed databases. We showed that (i) CONSCHECKER correctly 
detects known consistency bugs [13], [14], [12] and [11] under CCv, CM, CC, (ii) 
CONSCHECKER correctly detects known assertion violations in applications [19], 
[27], [12], [36]. We also did a stress test of CONSCHECKER on some SV-COMP 
benchmarks and parameterized benchmarks which resulted in a large number (6 
million) of traces. 


Related Work. SMC has been implemented in many tools CHESS [35], Concuer- 
ror [17], VeriSoft [24], NripHuGc [3], CDSChecker [37], RCMC[28], GenMC [30], 
rInspect [46] and Tracer [5]. While most of these work with either Mazurkewicz 
traces or po— rf traces, [6] proposes a RVF-SMC algorithm where the value read 
is used to decide equivalence of two runs. 

In recent years, there has been much interest in DPOR algorithms : [4] for 
SC, [30] for the release acquire semantics, [43] for C/C++, and [29] for TSO, 
PSO and RC11. It is known that CC is weaker than RA, CCv is stronger than 
RA while CM is incomparable with RA [31]. In conclusion, all the above memory 
models are different from CC, CCv, CM. Hence we cannot reuse any of the existing 
DPOR algorithms. 

Recent work on causal consistency [15] studies the complexity of checking 
whether one execution (all executions) of a program under CC, CCv, CM is consis- 
tent. They show that checking if an execution is consistent is NP-completeness, 
while the question of checking if all executions are consistent is undecidable. [11], 
[12] explore the robustness wrt SC, of transactional programs under CC, CCv, CM. 
However, none of these papers propose a DPOR algorithm for CC, CCv, CM. 


2 Preliminaries 


Programs We consider a program P consisting of a finite set T of threads (pro- 
cesses) that share a finite set X of (shared) variables, ranging over a domain V 
of values that includes a special value 0. 

A process has a finite set of local registers that store values from V. Each 
process runs a deterministic code, built in a standard way from expressions and 
atomic commands, using standard control flow constructs (sequential composi- 
tion, selection, and bounded loop constructs). Throughout the paper, we use x, y 
for shared variables, a, b, c for registers, and e for expressions. Global statements 
are either writes x := e to a shared variable, or reads a := x from a shared vari- 
able. Local statements only access and affect the local state of the process and 
include assignments a := e to registers, and conditional control flow constructs. 
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Note that expressions do not contain shared variables, implying that a statement 
accesses at most one shared variable. 

The local state of a process proc € T is defined by its program counter and the 
contents of its registers. A configuration of P is made up of the local states of all 
the processes. The values of the shared variables are not part of a configuration. 
A program execution is a sequence of transitions between configurations, starting 
with the initial configuration y'"*. Each transition corresponds to one process 
performing a local or global statement. A transition between two configurations 


y and y’ is of form y s y’, where the label @ describes the interaction with 
shared variables. The label ¢ is one of three forms: (i) (proc,£}, indicating a 
local statement performed by thread proc, which updates only the local state 
of proc, (ii) (proc, wt,z,v), indicating a write of the value v to the variable x 
by the thread proc, which also updates the program counter of proc, and (iii) 
(proc, rd, x, v) indicating a read of v from x by the thread proc into some register, 
while also updating the program counter of proc. There is no constraint on the 
values that are used in transitions corresponding to read statements. This will 
allow some illegal program behaviors, which is sorted by associating runs with so- 
called traces, which represent how reads obtain their values from writes. A causal 
consistency model X € {CC,CCv, CM} is formulated by imposing restrictions on 
traces, thereby also restricting the possible runs that are associated with them. 

Since local statements are not visible to other threads, we will not represent 
them explicitly in the transition relation considered in our DPOR algorithm. 
Instead, we let each transition represent the combined effect of some finite se- 
quence of local statements by a process followed by a global statement by the 
same process. For configurations y and 7 and a label Z which is either of the 


form (proc,wt,x,v) or of the form (proc,rd,x,v), we let 7 4 y! denote that 
we can reach y from y by performing a sequence of transitions labeled with 
(proc, €) followed by a transition labeled with £. Defining the relation > in this 
manner ensures that we take the effect of local statements into account, while 
avoiding consideration of interleavings of local statements of different threads in 
the analysis. 


We use y — 7’ to denote that 7 $ y’ for some £ and define succ(y) := 
{7 |y 7 7}, ie., it is the set of successors of y wrt. > . A configuration y is 


said to be terminal if succ(y) = @, i.e., no thread can execute a global statement 


from y. A run p from y is a sequence yo > 71 £> «++ “23 yn such that yo = 7. 


We say that p is terminated if yn is terminal. We let Runs(y) denote the set of 
runs from y. 

Events. An event corresponds to a particular execution of a statement in a run of 
P. A write event ev is given by (id, proc, wt(x,v)) where id € N is the identifier 
of the event, proc is the process containing the event, x € X is a variable, and 
v € Y is a value. This event corresponds to a process writing the value v to 
variable x. Likewise, a read event ev is given by (id, proc, rd(x)) where x € X. 
This event corresponds to a process reading some value to x. The read event 
ev does not specify the particular value it reads; this value will be defined in 
a trace by specifying a write event from which ev fetches its value. For each 
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variable x € X, we assume a special write event init, = wt(a,0) called the 
initializer event for x. This event is not performed by any of the processes in 
T, and writes the value 0 to x. We define Eini := {init, | £ € X} as the set of 
initializer events. If E is a set of events, we define subsets of E characterized by 
particular attributes of its events. For instance, for a variable x, we let E™%” 
denote {ev € E | ev.type = wt A ev.var = x}. 

Traces. A trace T is a tuple (E, po, rf), where E is a set of events which includes the 
set Eimi Of initializer events, and po (program order), rf (read-from) are binary 
relations on E that satisfy: 

e ev po ev' if process(ev) = process(ev’) and ev.id < ev’.id. po totally orders 
the events of each individual process. 

e ev rf ev’ if ev is a write event and ev’ is a read event on the same variable, 
which obtains its value from ev. 

We can view T = (E,po,rf) as a graph whose nodes are E and whose edges 
are defined by the relations po, rf. po depicted by red solid edges captures the 
order in each process while rf edges are depicted as solid blue edges. We define 
the empty trace Tg := (Einit, 0,0), i.e., it contains only the initializer events, and 
all the relations are empty. 

We define when a trace can be associated with a run. Consider a run p of form 


Yo ay ge Yn, Where 4; = (proc;, ti, £i, vi), and let 7 = (E, po, rf) be a trace. 
We write p = 7 to denote that the following conditions are satisfied: (i) E = 
{ev1,..., €Un}, i.e., each event corresponds exactly to one label in p. (ii) If 4; = 


(proc,,wt,x;,0;), then ev; = (idi, proc;,wt,2;,v;), and if l; = (proc;, rd, £i, vi), 
then ev; = (idi, proc;, rd, zi}. An event and its label do the same (write or read) 
on identical variables, and for writes, they also agree on the written value. (iii) 
id; =| {j| (<j <a)A (proc; = proc;)} |. ev.id shows how it is ordered relative 
to the other events of process(ev). (iv) if ev; rf ev; then z; = x; and v; = vj. 
(v) if init, rf ev; then v; = 0, i.e., ev; reads the initial value of x which is 0. 


3  Causally Consistent Models 


We study three variants [15] of causal consistency : CC,CCv and CM. To define 
the three models formally, we introduce a function that, for each model, extends 
a given trace uniquely by a set of new edges. Then we define the model by 
requiring that the extended trace does not contain any cycles. A run of the 
program satisfies a consistency model X if its associated extended trace has no 
cycles. 

Let CO, called causality order represent (po U rf)". Two events €1,€2 are 
causally related if either e1 CO ez or e2 CO e. 


Causal Consistency CC. We start presenting the weakest notion of causal 
consistency, CC [25], [7]. First we give an intuitive description of CC. In CC, 
events which are not causally related can be executed in different orders in 
different processes; moreover decisions made about these orders can be revised 
by each process. To illustrate, consider the program Fig.1(b). The write events 
wt(a, 1), wt(x, 2) are not causally related and hence can be ordered in any way. 
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(b) CC not CM, CCv (d) CM not CCv 


Pa Po Pe | Pa Po © Pa Pp t Pa Po 
x:=1 ||a:=y> l||a:=x>2 !x:=1 || x:=2 : 0 || x:=2 TE ges 
y:=1 || x:=2 a:=xpl ! arsexp]iZi=l ||a:=z>0: 

: bi=xb2 = a:=yp1!a=x e2 || a:=xo1 
(a) not CC, CM, CCv i y:=1 ||a:=x>2 


(c) Cev not CM 


Fig. 1. Programs showing the differences between consistency models. The >v denotes 
the expected return value of the read event. 


ÇF HB, wwt(z,0) 
GETTE, ae “HB, l wt(x,2) CE 

wt(z,1) rd) rd(x)\ wt(x. i)a- SF.. wti 2). wte jja HBa wttr 2) it 1) ` Od ee N 

ZIZ l Sel 7a PO Mee tale | meia wale» 
“wt(y,1) wt(x,2) rd) rd(x) rd(x) wta) | l 
Neno aai pen i | | rd(y) rd(x) rd(x) 
SW rd(x) rda) — wt(y,1) rac 

1(a) : CCcycle 1(b) : CCvcycle 1(b) : CMcycle 1(c) : CMcycle 1(d) : CCveycle 


Fig. 2. solid red, blue edges are po,rf, wt(x,v) and rd(x) are write, read events. 


Note that p first orders x := 1 after x := 2 and reads 1 into a; it then revises 
this order, and orders x := 2 after x := 1 and reads 2 into b. 

A trace T does not violate CC as long as there is a causality order which 
explains the return value of each read event. 

To capture traces violating CC, we define a relation OW (for overwrite) on 

writes to the same variable. For any two writes w1, w2 and a read r on a same 
variable, if wı CO we CO r, and wı rf r, then wg OW wy. This says that r 
reads the overwritten write w 1, resulting in a CO U OW cycle. We refer to 
CO U OW cycles as CCcycle. We define a function extendcc(T) which extends a 
trace T = (E,po,rf) by adding all possible OW edges between write events on 
the same variable. For a trace T = (E, po, rf), we say that 7 CC iff extendcc(T) 
does not have a CCcycle. 
Examples. Program Fig. 1(a) is not CC since there is no causality order which ex- 
plains the return values of the read events. If we consider any trace (Fig. 2) of the 
program Fig.1(a), we find that wt(y, 1) rfr where r = rd(y), wt(x, 1) po wt(y, 1), 
r po wt(a, 2). Then we get wt(x,1) CO wt(x, 2), wt(x,2) CO r’ where r’ = rd(x) 
and wt(x,1) rf r’ giving wt(x, 2) OW wt(z,1) witnessing CCcycle. 


Causal Convergence CCv. Under CCv, we need a total order on all write events 
per variable. This order, called arbitration order, is an abstraction of how con- 
flicts are resolved by all processes to agree upon one ordering among events 
which are not causally related. Thus, unlike CC, a process cannot revise its or- 
dering of the events which are not causally related, and all processes must follow 
one ordering. This makes it stronger than CC. 

To enforce a total order between all writes, we use a new relation CF called 
conflict relation on all write events per variable. For all variables x € V, and 
writes w1, w2 on x and a read r = rd(x), if w, CO r, and ws rfr then w, CF wə. 
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We define a function extendgcy(7) which extends a trace T = (E, po, rf) by adding 
all possible OW, CF edges between write events on the same variable. Traces 
violating CCv exhibit a CO U CF U OW cycle in extendccy(7), which we refer to 
as CCvcycle. We say that 7 = CCv iff extendecy(7) does not contain a CCvcycle. 


Examples. For the program Fig.1(b) and any trace 7, extendccy(7) has a CCvcycle 
(see Fig.2) since in any trace, we have w; = wt(x,1) CO rg where rə = rd(x) 
and w2 rf r2 for w2 = wt(x,2) giving wı CF w2. We also have wt(x,2) CO rı 
where rı = rd(x) with wı rf rı giving wz CF wy. Intuitively, we cannot find a 
total order amongst the writes to justify the reads of 1 and 2. 


However, the program Fig.1(c) has a trace T s.t. extendccy(T) does not have 
CCvcycle. In the corresponding run, we first allow pa to complete execution, 
followed by pp. 


Causal Memory CM. The CM model is stronger than CC and incomparable to 
CCv. Like CC, in CM also, a process can diverge from another one in its ordering 
of events which are not causally related. However, once a process chooses an 
ordering of such events, it cannot revise it; this makes it stronger than CC and 
incomparable to CCv. 


A happened before relation per process fixes the per process ordering of events. 
For a read/write event e in a trace, the Causal Past of e, CausalPast(e) = {e’ | 
e’ CO e} is the set of events which are in the causal past of e. For an event e, 
the happened before relation HBe [15] is the smallest relation on events which 
is transitive, and is such that for all events e,,e2 € CausalPast(e), ey CO e > 
e1 HBe eg. In other words, COjcausaipast(e) G HBe : HBe contains all pairs of 
events obtained by restricting CO to the events in the causal past of e. For any 
variable x, if we have writes w1, w2 on x and a read rp = rd(x) such that 
(i) r2 = e or r2 po e, w2 rf r2, and wı HBe r2, then wı HBe we, and 
(ii) if w, HBe we and w1 rf r2, then r HBe wə. 


Let ep be the po-last event of process p: that is, for all events e in process 
P, € = €p OF € po ep. Since HBe C HBe, for all events e in process p, HBe, fixes 
the ordering among all causally unrelated events for process p. We write HBp 
instead of HB.. 


We define a function extendcy which extends a trace T = (E, po, rf) by adding 
all possible OW,HB, edges for all processes p. Traces violating CM exhibit a 
OWU HB, cycle, called a CMcycle in extendcy(7) for some process p. We say that 
T H CM iff extendcy(7) does not contain a CMcycle. See Figure 3 which motivates 
conditions (i), (ii) to add HB edges so that extendey(T) does not contain CMcycle. 


Examples. For the program Fig.1(c) and any trace T, extendcy(7) contains CM 
cycle. Consider the read event op, = rd(a) with wt(x,2) rf op,. Then wt(x, 1) 
powt(y,1) rfrd(y) po op,, that is, wt(x, 1) CO op,. This induces wt(x, 1) HB, op,, 
and wt(z, 1) HB,, wt(x, 2). This results in wt(z, 1) po wt(x,1)HB,, wt(x,2) por 
where r = rd(z) with wt(z, 0) rfr. This gives wt(z, 1) HB,, wt(z, 0) resulting in a 
cycle. However, program Fig.1(d) has a trace 7 s.t. extendcy(7) does not contain 
CM cycle. 
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Fig. 3. Start with (a). In (b) we add the HB edge from rd(z) to wt(z,2) following 
condition (ii). Then (c) is obtained on adding rd(x), wt(z,1) rf rd(a). In contrast, (b)’ 
does not follow condition (ii). Hence, when the rd(x) is added in (c)’, wt(, 2) is available 
to be read. Choosing wt(x, 2) rf rd(x) necessitates adding wt(x,1) HB wt(a,2) in (d)’ 
by condition (i). This necessitates adding wt(z, 2) HB wt(z, 1) in (e)’ creating CMcycle. 


A run p satisfies a model X € {CC,CCv, CM} if there exists a trace 7 such that 
pT and T H X. Define yx := {Tx | 3p € Runs(y).p E Tx ATx H X}, the set 
of traces generated under X from a given configuration y. 


Note. Similar to our characterization of bad traces using cycles, [15] uses bad 
patterns in differentiated histories to capture violations of CC, CCv, CM. Differen- 
tiated histories are posets labeled with wt(a,v) and rd(x) > v such that no two 
events wt(x, v1) and wt(z, v2) have vı = v2. Bad patterns are characterized in 
[15] using the po and reads from relations on differentiated histories. Since we 
work with traces having po and rf, we do not require differentiated writes. 


4 Trace Semantics 


To analyse a program P under a model X € {CC, CCv, CM}, all runs of P must be 
explored. We do this by exploring the associated traces. In fact, two runs having 
the same associated traces are equivalent since the assertions to be checked at the 
end of a run depend only on po, rf. We begin with the empty trace, and continue 
exploration by adding enabled read/write events to the traces generated so far. 
While doing this, we must ensure that the generated traces 7 are s.t. rT |= X. 
We present two efficient operations to add a new read/write event to a trace T 
obtaining a trace 7’ so that extendx(7’) does not contain a Xcycle. We discuss 
two notions that are relevant while adding a new read event to a trace. 


Readability and Visibility. For all 3 models, readability identifies the write events 
w from which a newly added read r can fetch its value. Visibility is used to add, 
in the case of CCv, new CF edges (and in the case of CM, new HB edges) that 
are implied by the fact that the new read event reads from w. Let 7 = (E, po, rf) 
be a trace, and Tx = extend y (T). Let r% denote adding r to Tx. We define the 
readable set readable(T%,r, x) for read event r from process p on variable x. 


1. For X = CC, readable(r%,r,x) is defined as the set of all write events 
w E E™}? s.t. there is no write w € E*, s.t. w CO w’ CO r in rh. 

2. For X = CCv, readable(r{,7r,x) is defined as the set of all write events 
w € E“* s.t. there is no write w’ € E™* s.t. w (COUCF)t w’ CO r in rf. 

3. For X = CM, readable(T%,r,x) is defined as the set of all write events 
w € E“? s.t. there is no write w’ € E"®7 s.t. w HB, w’ HB, r in rf. 
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Intuitively, readable(T%,r, x) contains all write events which are not hidden 
in TY by other writes on x. The newly added read event r can fetch its value 
from a write in readable(r\,r,2). The visible set visible(r{,7r, xz) is defined 
as the set of events in readable(7\,r,2) which can “reach” r in Th. Let 77 
denote the trace obtained by adding r and w rf r to trace T. 


1. For X = CCv, visible(r{¥,r,z) = {e € readable(7},r,z) | e CO r}. The 
point of visible(7X, 1, x) is that when the new read r is added, which reads 
from a write w, extendx(r”™) will not contain Xcycle on adding from each 
e € visible(T%,r,x) a CF edge to w. Then extendx(7"”) contains {(e, w) | 
e € visible(r},r,x)}. 

2. For X = CM, visible(r},r, xv) = {e € readable(r\,r, x) |e HB,r} where r 
is a read in process p. The point of visible(r\,1r, x) is that when the new 
read r is added, which reads from a write w, extendx(7”™) will not contain 
Xcycle on adding from each e € visible(r%,r,x) a HB, edge to w. Then 
extend, (7””) contains {(e, w) | e € visible(r,r,x)}. 


The trace semantics for a model X € {CC, CCv, CM} is given as the transition re- 
lation +x_+,, defined as Tx —>x_tr T% where extendx (T) = Tx, extendx(7’) = 
Ty. The label œ is one of (read,r,w), (write, w) representing respectively, a 
read r reading from a write w, and a write event w. An important property 
of Tx —>x_tr T% is that if Tx does not have Xcycle, then T% also does not have 
Xcycle; in other words, if r = X, then 7’ = X. We now describe the transitions 


Tx “>x_tr T% Where extendx(T) = Tx, extendx(7’) = Th, T = (E, po, rf, T = 
(E’, po’, rf). We start from the empty trace To, extend y (To) = To- 


— From rx, assume that we observe a write w in process p. In this case, the 
label a is (write, w), and we add w, and a po edge from the po-latest event 
of process p in Tx to w obtaining ry. 

— From Tx, assume we observe a read event r on variable x in process p. In 
this case, the label a is (read, r, w), where w is the write from which r reads. 
Add the read r, a po edge from the po-latest event of process p in Tx to r 
obtaining 7}. Add rf from a w € readable(T%,r, x) to r. 

e When X = CCv. Add new CF edges from all w” € visible(rX,7r,x) to w 
to get Th. 

e When X = CM. Add new HB, edges from all w” € visible(T%,r, x) to w. 
Adding these HB, edges can result in w; HB, we for write events w1, w2 on 
a variable y. If we had wı rf rı, rı po r, then add rı HB, w2. When we are 
done adding all such HB, edges, we obtain 7. (Figure 4(iv)). 


Lemma 1. If tx = extendy (rT) with r = X, and Tx a T% = extendx(7’), 
then T' = X for X € {CC, CCv, CM}. 


Efficiency and Correctness. Each step of “+z; is computable in polynomial time 
. This is based on the fact that readable and visible sets are computable in 
polynomial time. The correctness of the trace semantics for a model X stems 
from the fact that it generates only those X-extensions which do not have cycles 
(Lemma 1). The transitions ensure acyclicity of the resultant extended traces. 
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Fig. 4. wi,we are writes on y, rı is a read on y, we,w3,Wa4,Ws5 are writes on x in 
tou. Add read r on x to Tu. w2, W3, W5 E readable(7éy, r, x). Choose ws rf r. Then we 
add w2 HB ws and w3 HB ws. The addition of w2 HB ws results in wı HB weg. Since 
wi rf7i, add the HB edge from rı to we to obtain Tom: 


Algorithm 1: EXPLORETRACES(X, Tx, 7) 


Input: X € {CC,CCv,CM} is a consistency model, 7x is an X-extension, 7 is 


an observation sequence. 
(write,w) 


1 if dw s.t. Tx ————> x_erTk then // handle a write event 
TA 
2 let w = wt(x, v) and perform Tx (fritet) X—trTx 
// follow trace semantics write 

3 EXPLORETRACES(X,T,7 © w) 

CREATESCHEDULE(TX, 7 @ w) 

aes d,r,— 
5 else if Ir s.t. Tx Baia X-trTx then // handle a read event 
6 Schedules(r) + 0; Swappable(r) + true 

, (read,r,w) j 1 

7 for w, Tx: Tx x-trTx do EXPLORETRACES(X,7x,7 @ (r, w)) 


// follow trace semantics read 
8 for 8 € Schedules(r) do RUNSCHEDULE(X, Tx, 7, 3) 


5 DPOR Algorithm for CC, CCv, CM 


We present our DPOR algorithm, which systematically explores, for any ter- 
minating program under the consistency models X € {CC,CCv,CM}, all traces 
Tx wrt X which can be generated by the trace semantics. Enabled write events 
from any of the processes are added to the trace generated so far, and we pro- 
ceed with the next event. For a read event r, we add r to the trace, and explore 
in separate branches, all possible write events w from which r can read from. 
Each such branch is a sequence of events also called a schedule. There may 
be writes w’ which will be added to the trace later in the exploration, from 
which r can also read. Such writes w’ are called postponed wrt r; when w’ is 
added to the trace later, the algorithm will have a branch where r can read 
from w’. In that branch, the algorithm reorders events in the sequence s.t. w’ 
and r exchange places, and all events which are needed for w’ to occur are also 
placed before w’ (CREATESCHEDULE). All generated schedules will be executed 


Optimal Stateless Model Checking for Causal Consistency 115 


Algorithm 2: CREATESCHEDULE(X, Tx,7) 


Input: X € {CC,CCv,CM} is a consistency model, 7x is an X-extension and 7 
is an explored observation sequence. 
1 let w be last(z) and x be var(w) 
2 for i + |r| — 1 to 1 do // look for reads r that have postponed w 


3 let r be the element at rfi] 
4 if r is a read on x^~(r CO w) A Swappable(r) then 
5 B + €; flag = true; 
6 for j + i+ 1 to |r|—1do // get all events after r in m and 
precedes w in CO 
let ev be the element at z[j] 
8 if ev CO w then 
9 if r CO ev then 
10 | flag = false; break; 
11 else 
12 | B+ Benj] 
13 if flag \ fp’ € Schedules(r). 8’ ~ Bewe (r,w) then 
14 Schedules(r) + Schedules(r) U{Gewe(r,w)} // r can read 
from w 


Algorithm 3: RUNSCHEDULE(X,Tx,7, 3) 


Input: X € {CC,CCv,CM} is a consistency model, 7x is a X-extension, 7 is an 
explored-observation sequence, and £ is a schedule. 


1 if 8 Æ «then // explore the sequence of observations one by one 
2 let 8 be ae 8' choose Th : Tx => X—trT // follow write and read 
3 if a = (read,r,w) then Swappable(r) + false 

4 RUNSCHEDULE(X, Tx, 7 @ a, B’) 


5 else EXPLORETRACES(X, Tx, 7) 


by RUNSCHEDULE. The algorithm is uniform across the models, with the main 
technical differences being taken care of by the respective trace semantics which 
guides the exploration of traces in each model. 
The EXPLORETRACES Algorithm. This algorithm takes as input, a consistency 
model X € {CC, CCv, CM}, an X-extension Tx and an observation sequence T. 7 is 
a sequence of events of the form (write, w) or (read,r, w). The initial invocation 
is with the empty trace 7) and observation sequence m = e. The observation 
sequence is used to swap read operations with write operations that are postponed 
wrt them. From the initial 7), we choose an operation from any of the processes. 
If a write operation is enabled, one such is chosen non deterministically from 
any process, and is added to the trace according to the trace semantics, and also 
appended to the observation sequence, whereafter EXPLORETRACES is called 
recursively to continue the exploration (line 3). After the recursive calls have 
returned, the algorithm calls CREATESCHEDULE, which finds read operations r 
in the observation sequence which can read from write operations w if w was 
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performed before r. For each such read r, CREATESCHEDULE creates a schedule 
for r, an observation sequence that can be explored from the point when r was 
performed, allowing w to occur before r so that r can read from w. When a read 
operation r is enabled, the set Schedules(r) is initialized (line 6). This set is 
updated by CREATESCHEDULE when subsequent writes are explored. We also 
keep a Boolean flag Swappable(r) for each read event r. This is initialized to 
true, indicating that r is swappable, that is, subsequent writes can be considered 
for r. This flag is set to false for read events appearing in a schedule so that they 
are not swapped, eliminating redundant explorations. For each generated write 
event w from which r can read, EXPLORETRACES is called recursively (line 7)to 
continue the exploration. Once these recursive calls have returned, the set of 
schedules collected in Schedules(r) for the read r is considered. RUNSCHEDULE 
explores all schedules, where the read fetches its value from the respective write. 


The CREATESCHEDULE algorithm. The input to this algorithm is a consistency 
model X, a trace Tx wrt X, and an observation sequence m whose last element 
is a write. The algorithm looks for reads in a for which w is a postponed write. 
Indeed, this read r and w must be on the same variable, r must be swappable, 
and r must not precede w wrt CO (line 4). We begin with the closest (from the 
write w) such read r at position 7[j]. After finding r, a schedule £ is created. The 
schedule consists of all elements following r in 7 and preceding w wrt CO (line 
12). It ends with w è (r, w), allowing r to read from w (line 13). This schedule 
is added to Schedules(r) if it does not already contain a schedule 8’ which has 
the same set of observations : Schedules(r) does not contain 8’ ~ £. 


The RUNSCHEDULE Algorithm. The inputs are a consistency model X, a trace 
Tx, an observation sequence 7 and a schedule 8. The schedule of observations 
in ( is explored one by one, by recursively calling itself, and updating the trace. 
The read events in the schedule are not swappable, preventing a redundant 
exploration for them (schedules where these are swapped with respective writes 
will be created by CREATESCHEDULE. All proofs and an illustrative example can 
be found in the extended version of the paper [2]. 


Theorem 1. Our DPOR algorithms are sound, complete and optimal. 


Soundness, Optimality and Completeness. The algorithm is sound in the 
sense that, if we initiate Algorithm 1 from (X,70,€), then, all explored traces 
T are s.t. T  X. This follows from the fact that the exploration uses the 
—x-_tr relation. The algorithm is optimal in the sense that, for any two different 
recursive calls to Algorithm 1 with arguments (X,7},71) and (X,7%,72), if 
Ty,Tx are extendible, then T Æ 7%. This follows from (i) for a given read r, 
each iteration of the for loop in line 7 will correspond to a different write, (ii) in 
each schedule 8 € Schedules(e) in line 8 of Algorithm 1, the read event r reads 
from a write w which is different from all writes it reads from in line 7 (iii) Any 
two schedules added to Schedules(e) at line 14 of Algorithm 2 will be different. 
The algorithm explores traces of all terminating runs, and is hence complete. 
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6 Experimental Evaluation 


We describe the implementation of our optimal DPOR algorithm for the causal 
consistency models CC, CCv, CM as a tool CONSCHECKER, available at[45]. To the 
best of our knowledge, CONSCHECKER is the first stateless model checking tool 
for the causal consistency models CC, CCv, CM. 

CONSCHECKER. CONSCHECKER extends NIDHUGG [3] and works at LLVM 
IR level accepting a C language program as input. At runtime, CONSCHECKER 
controls the exploration of the input program until it has explored all the traces 
using the DPOR algorithm. It can detect user-provided assertion violations by 
analyzing the generated traces. We conduct all experiments on a Ubuntu 22.04.1 
LTS with Intel Core i7-1165G7 and 16 GB RAM. We evaluate CONSCHECKER on 
the following categories of benchmarks, as seen below. 

Experimental Setup. We consider the following categories of benchmarks. 

e A set of thousands of litmus tests (sec 6.1) generated from [8]. The main 
purpose of these experiments is to provide a sanity check of the correctness of 
CONSCHECKER on all three consistency models. 

e A collection (sec 6.2) of concurrent benchmarks taken from the TACAS com- 
petition on software verification [44]. These are small programs with 50-100 lines 
of code used by many tools [4], [5]. 

e Five applications (sec 6.3) : Voter [19], Twitter clone [27], Fusion ticket [27], 
two versions of Auction [36], extracted from literature on databases, and verify 
against assertion violations wrt the three consistency models. 

e Classical database benchmarks (sec 6.4) reported in recent papers on con- 
sistency models [13], [12] and [14]. We classify these benchmarks SAFE and 
UNSAFE on all three models depending on whether they witness an assertion 
violation. 

e Eight parameterized programs (sec 6.5) from [5] and [4] to study the scalability 
of CONSCHECKER when increasing the number of processes, as well as read and 
write instructions in programs. 


6.1 Litmus Benchmarks 


We apply CONSCHECKER on a set of 9815 litmus benchmarks generated from [8]. 
Litmus tests are standard benchmark programs used by many tools running on 
weak memories. In these litmus tests, the processes execute concurrently, and 
we validate assertions on the underlying memory model, doing a sanity check 
for the correctness of CONSCHECKER. We compared the observed outcomes of 
CONSCHECKER on the litmus tests with expected outcomes generated from [8]. 
We generated the expected outcomes by simulating the CCv, and CC and CM 
semantics on [8] for these litmus tests. Out of the 9815 litmus tests, we found 
no assertion violations in 3810 under CC, CM and 3811 under CCv. Results obtained 
from CONSCHECKER matched with the expected outcomes. CONSCHECKER took 
<3 mins to execute on all litmus tests across models. 
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Table 1. Classical benchmarks 


Program CCv cc CM 
Causality Violation [13] UNSAFE UNSAFE UNSAFE Table 2. Applications. 
Causal Violation [14] SAFE SAFE SAFE 
Delivery Order [12] UNSAFE UNSAFE SAFE App Time 
Long Fork [13] UNSAFE UNSAFE UNSAFE ~ Vote [19] Is 
Lost Update [13] UNSAFE UNSAFE UNSAFE Twitter clone [27] 0.09s 
Message Passing [11] SAFE SAFE SAFE FusionTicket [27] 0.75s 
Conflict violation [14] UNSAFE UNSAFE UNSAFE Auction [36] 0.11s 
Read Atomicity [14] UNSAFE UNSAFE UNSAFE  Ayction-2 [36] 1.17s 
Repeated Read [14] UNSAFE UNSAFE UNSAFE Group 0.10s 
Load Buffer SAFE SAFE SAFE 
Store Buffer [11] UNSAFE UNSAFE UNSAFE 
Write Skew [13] UNSAFE UNSAFE UNSAFE 


Table 3. SV-Comp Benchmarks 


CCv cc CM 
Program Traces Time Traces Time Traces Time 
Lamport 15669 3s 2904225 490s 299028 110s 
Szymanski 1023397 131s 1023397 115s 1023397 190s 
Peterson 5371 Is 13483 1s 12316 1.5s 
Fibonacci 6224342 769s 6224342 695s 6224342 1796s 
Dekker 86267 7s 1549862 155s 107698 18s 


6.2 SV-COMP Benchmarks 


These benchmarks [44] consist of five programs written in C/C++ having 2 
processes each, with 50-100 lines of code per process (Table 3). The main chal- 
lenge in these benchmarks is the large number of traces to be explored. These 
benchmarks have assertion checks, and under CCv,CM, and CC all these asser- 
tions are violated. CONSCHECKER stops exploration as soon as it detects the 
first assertion violation. To check the efficiency of CONSCHECKER, we removed 
all assertions and let CONSCHECKER exhaustively explore all po-rf traces. Since 
these benchmarks have large number of traces, they serve as a stress test. 


6.3 Database Applications 


Table 2 reports the performance of CONSCHECKER on a set of programs inspired 
from five applications extracted from the literature on distributed systems [19], 
[27], [12], [36]. The applications we considered are 

e Voter [19] : This application is derived from a software system used to record 
votes from a talent show. Users can vote for any of the n contestants from any one 
of the m sites (processes). The application asserts that users cannot vote from 
multiple sites and cannot vote for multiple contestants and checks for violations 
of this. [19] considers 3 sites and 3 users, and we follow suit. 
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Table 4. Parameterized Benchmarks from [5] and [4] 


cc CCv CM 
Program Traces Time Traces Time Traces Time 

control-flow(6) 77 0.05s 77 0.05s 77 0.05s 
control-flow(8) 273 0.07s 273 0.06s 273 0.10s 
control-flow(10) 1045 0.16s 1045 0.12s 1045 0.33s 
control-flow(12) 4121 0.60s 4121 0.45s 4121 1.80s 
n-writers-a-read(5) 6 0.05s 6 0.05s 6 0.05s 
n-writers-a-read(10) 11 0.05s 11 0.05s 11 0.05s 
n-writers-a-read(15) 16 0.05s 16 0.05 16 0.05s 
n-writers-a-read (20) 21 0.05s 21 0.05 21 0.05s 
redundant-co(5) 91 0.07s 91 0.05s 91 0.05s 
redundant-co(10) 331 0.09s 331 0.05s 331 0.08s 
redundant-co(15) 721 0.11s 721 0.08s 721 0.12a 
redundant-co(20) 1261 0.18 1261 0.13s 1261 0.20s 
casrot(9) 8579 0.55s 8597 0.77s 8597 2s 

casrot(10) 38486 2.50s 38486 3.16s 38486 9s 

casrot(11) 182905 14s 182905 16s 182905 49s 

floating-read(9) 10 0.05s 10 0.05s 10 0.05s 
floating-read(11) 12 0.05s 12 0.05s 12 0.05s 
floating-read(13) 14 0.05s 14 0.05s 14 0.05s 
lastwrite(9) 9 0.04s 9 0.04s 9 0.04s 
lastwrite(11) 11 0.04s 11 0.04 11 0.04s 
lastwrite(13) 13 0.04 13 0.04 13 0.04s 
lastzero(9) 1536 0.18s 1536 0.20s 1536 0.33s 
lastzero(11) 7168 1s 7168 1s 7168 2s 

lastzero(13) 32768 5s 32768 5s 32768 12s 

readers(9) 512 0.10s 512 0.10s 512 0.18s 
readers(11) 2048 0.40s 2048 0.35s 2048 1s 

readers(13) 8192 1.5s 8192 1.5s 8192 6s 


e Twitter clone |27] : This is based on a twitter like service where each user has 
some followers. The following assertion is checked : when the user tweets, the 
tweet ID must be added to the follower’s time line exactly once if the user did 
not remove his tweet. We considered 3 users using 3 processes, each process has 
10 tweet IDs and 6 followers. 


e Fusion ticket [27] : There is a building having multiple concert rooms (venues). 
Tickets for venue 7 are sold by salesperson i who updates in the backend database, 
the sales for the day. The per venue ticket sale must be updated correctly in the 
database, so that the concert manager sees the correct total number of tickets 
sold. A discrepancy in this number is a violation. Each venue is represented 
by a process, and the communication across processes ensures the total sum is 
correct. We considered 4 venues and each venue had 10 tickets. 


e Auction [36] and Auction-2 [36]: There are n bidders and an auctioneer partic- 
ipating in an auction, modeled using n+1 processes. The assertion to be checked 
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is that the highest bidder must be declared winner. Auction is the buggy version 
for this application, while Auction-2 is the correct one. 

e Group is a synthetic application created by us inspired from whatsapp groups. 
There is a group with n members, and a new person wants to be added to the 
group. This person must be added to the group only by one of the existing 
members. That is, a violation constitutes to adding a person more than once (by 
one or more members). We check with 6 processes(members). 


6.4 Classical Benchmarks 


Table 1 consists of classical benchmarks [13], [14], [12] and [11] which test for 
some assertion violations under the three models. Since the traces generated 
differ for each model X € {CC,CCv, CM}, the violations also differ. For the ones 
marked SAFE under model X € {CC,CCv, CM}, the assertion violation did not 
occur under any execution, while the unsafe ones reported the violation. We con- 
sider twenty such examples. We consider three different versions of each example, 
varying the number of processes and variables. 

For each example, we have three versions by parameterizing the number of 
processes and instructions. In version 1, we have four processes per program 
and three to five instructions per process. Version 2 is obtained allowing each 
process to have seven-ten instructions. Version 3 expands version 2 by allowing 
each program to have up to five-six processes and up to 15-20 instructions. The 
number of instructions is increased by introducing fresh variables and having 
reads/writes on them. Versions 2,3 serve as a stress test for CONSCHECKER as 
increasing the number of instructions and processes increases the number of of 
consistent traces. CONSCHECKER took less than 3s to finish running all version 
1 programs, about 30s to finish running all version 2 programs and about 200s 
to finish running all version 3 programs. 


6.5 Parameterized Benchmarks 


Table 4 reports experimental results of CONSCHECKER on 8 parameterized bench- 
marks. Out of these, in redundant-co(N) (taken from [5]), N is the number of 
loop iterations per process in a program with 3 processes. In all others, the 
parameterization is on the number of processes. This set of benchmarks serves 
to check the scalability of CONSCHECKER. As seen in Table 4, CONSCHECKER 
scales up to 20 processes (n-writers-a-read) and 13 variables (lastzero). 


7 Conclusion 


In this paper, we have provided a DPOR algorithm using the po — rf equivalence 
for three prominent causal consistency models, and also implemented the same 
in a tool CONSCHECKER. This is the first tool for stateless model checking of 
causal consistency models. We plan to extend our work by developing a DPOR 
algorithm for transactional programs under CC, CCv, CM [12]. For these, the extra 
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complication is the presence of transactions which must be executed atomically 
without interference in each process. The final notch is to handle snapshot iso- 
lation, the strongest among transactional consistency models. 
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Data-Availability Statement 


The tool and experimental data for the study are available at the Zenodo repos- 
itory: [45]. 
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Abstract. The need to provide formal guarantees about the behaviour 
of the algorithms underpinning modern distributed systems became ev- 
ident in recent years. This interest made apparent the complexities in- 
volved in applying verification techniques in a distributed setting, with 
significant effort being made in both academia and industry to aid in 
this endeavour. Many formalisms have been proposed to tackle the diffi- 
culties faced by practitioners, with one that has seen widespread use in 
industry being TLA”, adopted, for instance, by Amazon Web Services. 
TLA provides engineers with a way of specifying both systems and 
desired properties, and is supported by a number of verification tools. 
Despite their extensive use, such tools suffer considerably from lack of 
scalability. To solve this, we propose a novel encoding of TLA* into SMT 
constraints to improve symbolic model checking efficiency. Our insight is 
the need to provide the SMT solver with structural information about 
the TLA™ specification encoded, i.e., how data structures and their com- 
ponent elements interact, which we do by relying on the SMT theory 
of arrays. We implemented our approach by modifying the SMT-based 
model checker APALACHE and evaluated it against comparable tools. Our 
results show that our approach outperforms existing ones on a number of 
benchmarks, with an order of magnitude improvement in checking time. 


Keywords: Model checking - SMT arrays - Distributed algorithms 


1 Introduction 


Distributed systems are ubiquitous in the modern world, with many companies 
directly relying on them to conduct business. Due to this, the ability to ensure 
that a distributed system is operating correctly is paramount. The search for cor- 
rectness guarantees led to an influx of interested parties adopting formal verifica- 
tion methodologies in recent years. One of the most famous example of this trend 
is probably the adoption of TLAt [17] by Amazon Web Services [19]. TLA* is 
a specification language based on the temporal logic of actions (TLA) which 
allows users to describe the expected behaviour of a system, while abstracting 
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away implementation details that do not impact high-level properties, e.g., mem- 
ory management. With TLA* specifications at hand, Amazon engineers rely on 
model checking for correctness guarantees of systems such as DynamoDB [23]. 

Despite recent interest and advances, the verification of distributed systems 
remains notoriously difficult. This is mainly due to the fact that, given their 
distributed nature, distributed algorithms’ executions admit numerous potential 
interleavings of steps, with state-spaces generally growing exponentially with 
the number of participants. In the case of TLAT, a handful of tools are avail- 
able to aid in verification [14]. TLC [27] is an explicit-state model checker that 
enumerates all reachable states of the given system. APALACHE [13] is a sym- 
bolic bounded model checker that uses a satisfiability modulo theories (SMT) 
encoding of states in order to better tackle the state-space explosion problem. 
TLAPS [6] is an interactive proof system that enables the proving of properties 
without the need of exploring the state-space itself. Despite providing the ben- 
efit of verifying specifications with infinite state-spaces, and efforts being made 
towards partial automation [18], TLAPS adoption is still slow, with engineers 
favouring the push-button automation provided by model checkers. 

In this work we focus on symbolic model checking for TLA*, as spearheaded 
by the SMT encoding which underpins APALACHE, but provide insights into 
SMT-based model checking that may generalise to other contexts. The encod- 
ing of TLAt into SMT done by APALACHE removes all structural information 
present in the encoded specification, with all TLA + data structures being repre- 
sented via uninterpreted constants in the generated SMT formula. The informa- 
tion not forwarded to the SMT solver has the potential to significantly improve 
solving efficiency. We propose an alternative SMT encoding that makes full use 
of the SMT theory of arrays [8] to encoded the main TLA* data structures, i.e., 
sets and functions, with the goal of improving solving performance, which is the 
determining factor in overall model checking performance. 

Concretely, we modify APALACHE’s abstract reduction system (ARS) to gen- 
erate constraints in the SMT theory of arrays, while relying on its preprocessing 
infrastructure, as shown in Figure 1. APALACHE rewrites the input specification 
into the KerAt verification-friendly fragment of TLA* [13] and then applies 
ARS rules to generate the SMT formula to be solved. We implemented our en- 
coding in APALACHE and compared it with APALACHE’s constants encoding and 
TLC. Our experiments indicate that embedding structural information into the 
SMT formulas has a significant impact on performance. Our contributions are: 


1. Formalisation of a TLAt encoding into the SMT theory of arrays; 
2. Development of a robust open-source implementation of our encoding; 
3. Evaluation via checking agreement on three asynchronous protocols. 


The paper is structured as follows: background is given in Section 2, the 
arrays-based encoding and its evaluation are presented in Sections 3 and 4, re- 
lated work is discussed in Section 5, and our final remarks are made in Section 6. 
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Fig. 1: Overview of the symbolic model checking for TLA+. The dotted box high- 
lights the identification of symbolic transitions from [16] and the rewriting into 
KerA*. The dashed box highlights the encoding based on uninterpreted con- 
stants from [13]. The solid box highlights the arrays-based encoding we propose. 


2 Background 


In this section we introduce the basics of TLA*, its KerAt fragment used to 
represent TLATt’s core, the approach to generate SMT constraints from KerA* 
via abstract reduction, and finally the SMT theory of arrays. 


2.1 TLA+ 


We introduce TLA® via a specification of the asynchronous Byzantine agreement 
protocol by Bracha and Toueg [5], shown in Figure 2. Here we focus on the most 
relevant TLA* constructs, with further details being available in [17]. 

The first notable aspect of TLAT is that specifications may be parametrised, 
e.g., the number of processes and faults may not be fixed. In our example, the 
keyword CONSTANTS, in line 3, is used to declare its parameters: N, the total 
number of processes, and T and F, the maximal and actual number of faulty 
processes. It is important to understand, however, that while a specification 
may be parametrised, model checking can only be carried out for a specific 
instance of the protocol at a time, e.g., N = 4 and T = F = 1. Parameter 
declarations are followed by variable declarations, by the use of the VARIABLES 
keyword, in line 4. Variables define the states of the state-machine that the 
specification describes, with each state being defined by the combination of the 
values held by each variable. In our example, each state is defined by the values 
of sentEcho, sent Ready, rcudEcho, rcvdReady, and pc. 

The remaining TLA* operators describe state-machine transitions or prop- 
erties to be checked, and are defined using =. Two operators are of special sig- 
nificance, one that defines the initial-state predicate and one that plays the role 
of the transition operator. In our example, these operators are Init, in line 8, 
and Nest, in line 22. Concretely, Init defines the starting point for state-space 
exploration and Neat defines the exploration itself. Transitions are guided by 
constraints that must hold in both pre-transition states, represented by non- 
primed variables, and post-transition states, represented by primed variables. 
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MODULE ABA 


EXTENDS Integers, FiniteSets 
CONSTANTS N, T, F 
VARIABLES sentEcho, sentReady, revdEcho, revdReady, pc 
Corr Les (N = F) The set of correct processes 
Byz (N —F+ 1) .. N The set of Byzantine processes 
Proc 1..N The set of all processes 
Init A pe € [Corr > { "V0", “VI” } 
A revdEcho = |p € Corr + {}] A rcudReady = [p € Corr +> {}] 
A sentEcho € SUBSET Byz A^ sentReady E€ SUBSET Byz 

Receive(p, nextEcho, nextReady) 2., Omited for brevity 
SendEcho(p, nextEcho, nextReady) =... Omited for brevity 
SendReady(p, nextEcho, nextReady) 2 

A pe[p] = “EC” 

A^ V Cardinality(neztEcho) > (N+ T+2)+2 

V Cardinality(neztReady) > T +1 

A pe’ = [pc EXCEPT ![p] = “RD"] A sentReady’ = sentReady U {p} 

^ UNCHANGED sentEcho 
Decide(p, nextReady) = 
20 A pe[p| = “RD” A Cardinality(nertReady) > 2* T +1 
21 A pe’ = [pc EXCEPT ![p] = “AC”] A UNCHANGED (sentEcho, sentReady) 
22 Nert = 3p € Corr, nextEcho € SUBSET sentEcho, nettReady E€ SUBSET sentReady : 
23 A^ Receive(p, nextEcho, nextReady) 
24 A V SendEcho(p, nertEcho, nertReady) V SendReady(p, nextEcho, nextReady) 
25 V Decide(p, nextReady) V UNCHANGED (pc, sentEcho, sentReady) 
26 NoDecide 2 Vp E Corr: peip] # “AC” Invariant stating that processes never Decide 
27 | | 


|b llè l> I> 


oaonrtianwntt Wwnreoevnd ma NH Da A wn 


Fig. 2: Example of a TLA? specification, based on the asynchronous Byzantine 
agreement protocol by Bracha and Toueg [5]; simplifications made for brevity. 


Specifications may optionally define invariants, i.e., properties that should 
hold in every reachable state. There is no special syntax for invariants, and they 
are provided by name to model checkers at invocation time. In our example, we 
have one invariant, NoDecide, in line 26. A specification satisfies NoDecide if no 
state reachable from Init via any number of Nest transitions has pc[p] = “AC”, 
for some p € Corr. Abstractly, this invariant holds iff Decide can never be taken. 


2.2 KerA+ 


TLA®* provides users with a myriad of ways of specifying systems. This richness, 
although being one its strengths, adds significant difficulty to the generation of 
SMT constraints. To overcome this challenge, TLA* specifications are rewritten 
into a more compact language, KerAt, before being checked. From KerA*, the 
ARS can generate SMT constraints in a simpler and provably sound way. 

The KerA* language consists of a small subset of TLAt conjoined with 
four additional constructs not originating from TLA™*, and is able to express 
almost all TLA* expressions. It contains constructs for the manipulation of sets, 
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ce : Set[Set[Int]] 


PA i 


: Set[Int] : Set[Int] : Set[Int] : Set[Int] 
ET GNE \3 
Cı ¿Int co: Int c3: Int 1:int co: Int cg: Int yi tnt co: Int cg: Int 
(a) integers (b) sets of integers © set of sets of integers 
5, 6, and 7 {5,6} and {6,7} {{5, 6}, {6, 7}} 


Fig. 3: Illustration of three arenas. The captions describe the modelled elements 
with the overapproximation cı = 5, C2 = 6, c3 = 7, c4 = {5,6}, cs = {6,7}, 
and cg = {{5,6}, {6,7}}. Note that the concrete value of a cell can be given by 
any of the possible subtrees having said cell as a root, e.g., for cg we have that 
J c4 € P({5, 6}),c5 € P({6,7}) . ce E€ P({c4,c5}); P stands for power set. 


functions, records, tuples, and sequences, as well as integer arithmetic operators, 
Boolean and integer literals, and constants, with all data structures having a 
bounded size. The semantics of KerA* derive directly from the TLA* constructs 
it uses, with the non-TLAT based constructs, which help simplify the rewriting 
system, having simple control semantics. The correctness of the rewriting itself 
is guaranteed by construction. One example is the rewriting of S U T into the 
set comprehension {x € S : x € T}. Further KerA* details are available in [13]. 


2.3 Abstract Reduction System 


In order to verify a specification in KerAt we generate a SMT formula that is 
equisatisfiable to it. To do so, we use an abstract reduction system (ARS) which 
iteratively applies reduction rules that transform KerA* expressions into SMT 
constraints. The core of the ARS is the arena, a graph structure that overapprox- 
iamtes the specification’s data structures and guides rule application. The rules 
collapse KerA* expressions into cells, which represent the symbolic evaluation 
of these expressions, with the cells then being used as vertices in the arena. The 
arena edges represent the data structures overapproximation, e.g., a cell repre- 
senting a set will have directed edges to the cells representing all its potential 
elements, as illustrated in Figure 3. The reduction process terminates when the 
initial KerA* expression e is collapsed into a single cell c, producing a SMT 
formula ® in the process, such that c A ® is equisatisfiable to e; equisatisfiability 
relies on the boundedness of the data structures and is detailed in Section 3.3. 
The satisfiability of e can then be checked by forwarding c ^ ® to a SMT solver. 

Formally, the ARS is defined as (S, ~~), with S being the set of ARS states 
and ~ C S x S being the transition relation. A state (e,A,v,®) € S is a four- 
tuple containing a KerA* expression e, an arena A, a binding of names to cells 
v, and a first-order formula ©. ARS states’ elements contain a number of cells, 
which are first-order terms annotated with a type T. Cells of type Bool and Int 
are interpreted in SMT as Booleans and integers, while cells of the remaining 
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types are encoded as uninterpreted constants in the constants encoding; the 
arrays encoding approach is discussed in Section 3. Cells are referred to via the 
notation Cname OF Cinder, and they can be seen as both KerAt constants and 
first-order terms in SMT. An arena is a directed acyclic graph A = (V, E), with 
V being a finite set of cells and € C V x (1..|V|) x V being a set of relations 
between the cells in V. Every relation between cells is represented by an arena 
edge of form (Ca, i, Cp), also written Ca—>cp, with no duplicates, i.e., for every pair 
(Caz 341, Ch; ); (Caz; 42; Cba) E E we have that cg, = Ca, ^ Cb, Æ Cb, implies i1 Æ i2, 
and no gaps in the relation indexes, i.e., for every edge (Ca,i,C) and index 
j € 1..(i — 1) we have that J ce E€ V . (Ca, j, Cc). A binding is a partial function 
from KerA* variables to V of A, i.e., a mapping from variables to cells. Finally, 
® is a formula in the SMT fragment supported by the ARS and the target SMT 
solver, e.g., the quantifier-free uninterpreted functions and non-linear arithmetics 
(QF_UFNIA) fragment supported by the constants encoding. 

A series of n reduction steps has the form so~...~>sSn, with each step gener- 
ating state s;,, for state s;, 0 < i < n, by applying a reduction rule. The initial 
state so = (eo, Ao, Vo, ®o) has eo as the initial KerAt specification, Ao = (0,9), 
vo containing no mappings, and ®ọ = true. The reduction steps end upon 
reaching a state sn = (en, An, Vn, ®n), with en being a single cell c € V, and 
An = (Vn, En). Below we give two examples of rules. 


Integer literal reduction. One of the simplest rules has an integer literal num 
being rewritten into a cell Chum. This cell is added to the arena and a constraint 
equating Cnum to the literal is conjoined with ®; we use vertical lines to separate 
state elements and commas to indicate additions to A and conjunctions to ®. 


(num : Int | A|v | ®) num is one of 0,1, —1,... 


(INT) 
(Chum | Ay Chaar : Int | v | D Chum = num) 


The descriptions of rules can be given as inferences, with the premisses above 
the bar and the resulting state below it. Inferences, although reasonable to ex- 
press rules such as Int, are not suitable to give the intuition about how more 
complex rules work. In light of this, we will use a simplified notation moving for- 
ward. We inline inferences as — and omit nonessential information, e.g., propa- 
gated values. Below we can see rule Inr in this simplified format. Note that only 
A and © updates are shown, without propagating them, and that v is omitted. 


num : Int 


; => c C; :Int|c = num Int 
num is one of 0,1, —1,... num | Cnum | Enum (Int) 


Picking. To pick a cell out of n cells we use an oracle 0, as per rule FromBasic. 
In addition to the FROM ... BY 0 expression, this rule requires that all pickable 
cells are of the same basic type 7, e.g., Int. The resulting state has a new cell 
Cpick, Which is equated to one of the n cells if 1 < 0 < n and is unconstrained 
otherwise. Picking among cells representing data structures, e.g., sets, can be 


132 R. Otoni et al. 


done via a more general version of rule FromBasic, which we omit for brevity. 


FROM cy,...,¢n BY 0:7 


, : — Crick | Cpick | T \ 0 = i > Chick = Ci 
T is basic and cy : T,...,Cn | T pick | Cpick € T | ( pick i) 


l<i<n 
(FromBasic) 


2.4 SMT Theory of Arrays 


The theory of arrays provides a natural way to encode data structures and is thus 
a prime candidate as an encoding target for TLA* constructs. Here we present 
the theory’s operators relevant for our work, further details can be found in [8]. 

Given the set of sorts S, containing one sort s, for each type T in KerAT, an 
array sort S;,,r, has the form Sn > Sr, with s, € S being its index sort and 
Sr € S being its value sort. Each array sort is supported by two basic operators, 
select : (Sm => Sr, Sr ) > Sr, which handles array access at a given index, and 
store : (Sp, => Sm, Sn;Sn) > Sn = Sm, which updates an array for a given 
index and value. For brevity, we will write select(a,i) as afi] in the remainder 
of the manuscript. Regarding equality between arrays, different interpretations 
are possible. We use arrays with extensionality [25], which are considered equal 
if they contain the same values in the same entries. Extensionality is formally 
defined as V a,b: sn => Srn . a =bVẸ i:s, . ali] # di]. For access and update, 
consistency is ensured by the following property: 


Vai Sa > Sr. ii Sry JE Sry U iSro 
store(a, i, v)ļi] = v A (i = j V store(a, i, v)[j] = al) 
Ne 


access consistency update consistency 


In addition to select and store, the theory of arrays can be extended with 
other operators, two of which are maps and K,_, whose signatures are shown 
below. The mapy operator applies a n-ary function f : (s;,,...,5;,.) —> Sr to the 
values stored in each index of its array arguments, producing a new array whose 
values are the result of the function application, i.e., mapp is the pointwise array 
extension of f. The K., operator produces a constant array, with all its values 
being the constant provided as argument. The properties defining the behaviour 
of these two operators are shown after their signatures. 


mapy : (Sr => Sr, +++) Sr => Srn) > Sr => Sry Ks, £ Sron > Sr => Strons 


V a1 |: Sr > Sry, -o An i Sr > Sr, Í: Sr . Mapp (ai, -an )li] = f (ai[i], ---, @n[é]) 
Vi: Sn, U:S,. Ks, (v)[i] =v 


The select and store operators are part of theory of arrays with extension- 
ality defined in version 2.6 of the SMT-LIB standard [3]. Other operators are 
provided on a solver-by-solver basis, e.g., Z3 [7] supports both maps and K,,, 
while CVC5 [2] supports K,,; SMT-LIB updates may add them to the standard. 
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3 Encoding TLA+ using Arrays 


Our goal is to encode TLA* data structures in a structure-preserving way. To do 
this, we use arrays to represent the main components of TLA*, sets and func- 
tions, as SMT constraints. We follow the ARS structure described in Section 2.3, 
but update the reduction rules handling sets and functions. The remaining TLA* 
constructs, e.g., tuples, are represented as per the constants encoding. 

The two efficiency benefits of the arrays encoding are the ease of access 
of data structures and the possibility of using SMT equality. The first benefit 
can be easily understood by the use of SMT select, which allows us to check a 
stored value by using a single constraint, in contrast to the amount of constraints 
used in the constants encoding, which is linear in the size of data structures’ 
overapproximation. The second benefit affects the comparison of data structures, 
which can be done via a single SMT equality for sets and functions in the arrays 
encoding, since these structures are represented by a single SMT term, while the 
constants encoding requires a number of constraints that is quadratic in the size 
of data structures’ overapproximation. A summary can be seen in Table 1. We 
first describe how to encode sets and functions, and then present the correctness 
argument for the reduction to arrays. 


3.1 Encoding TLA-+ Sets using Arrays 


We use arrays to encode TLA® sets as characteristic functions, i.e., a set of type 
T is represented by an array of sort s, = Bool. Set membership is encoded by 
storing true or false on a given array index. The reduction rules used to handle 
the main set operators are presented below. 


Set Enumeration. The simplest way to create a set is to enumerate its elements. 
Rule Enum reduces an explicit set of cells to a fresh cell Cset, whose edges link it 
to its elements; Cset—C1,.-.,Cn is a shorthand for Cset C1, wu, Cget> Cn, There 
is no guarantee that the enumerated elements are unique, thus the arena may 
contain edges to repeated elements. 


{c1,..., En}: Set[T] — Cet | Cset : Set[T], Cset C1,---;Cn | EnumCtr (Enum) 


The constraints EnumCtr added by the arrays encoding create an empty set, 
by using a constant array with the value false, L, and updates the array by storing 
true, T, on the appropriate indexes. The array resulting from the last update, 
al is then equated to Cset. Since cells representing repeated elements lead to 
updates to the same index, we encode standard sets, in contrast the constants 
encoding, which encodes multisets due to the arena imprecision; multisets lead to 
multiple constraints being generated to encode membership of a single element. 


ees =K, (L) ^ A Bese = store(ag) Ci, T) N Cset = Oe (EnumCtr) 
1l<i<n 


ees C 
empty set set updates cell equality 
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Although the amount of constraints generated by the arrays encoding to 
model set enumeration is equal to that of the constants encoding, it has the 
benefit of generating a defined interpretation for Cset, the array ag „, which 


is not present in the constants encoding. This has a significant impact on set 
membership and cell equality, as described below. 


Set Membership. The checking of 
a membership relation cz € Cset, 
given the presence of the arena edges 
Cset C1, -Cn and 1 < z < n, 
is straightforward. A single fresh cell 
of Boolean type is introduced and is — 1 ——————— 
equated to Cset|cz]. Construct Arrays Constants 


Set enumeration O(n) O(n) 


Table 1: Amount of constraints gener- 
ated by each SMT encoding to model 
the main TLAT constructs. 


Cell Equality. The constraints gener- Set membership O(1) O(n) 
ated by encoding set membership and Set equality O(1) O(n?) 
many other constructs assume that cells Set filter O(n) O(n) 
can be compared. When this is not di- Set map O(n) O(n) 
rectly the case the equalities are cached Fun. definition O(n) O(n) 
in preparation. For example, if a set of Fun. domain O(1) O(n) 
n tuples c+ of size two is being equated, Fun. equality O(1) O(n?) 
the constraints C4 = C4 + Ch =C} ^ Fun. update O(1) O(n) 
cz, =G p with 1<i<mnand1<j<n, Fun. application O(n) O(n) 


are added to ®; here we use c} and c? 


to represent the values of the 2-tuple. The need for this caching of equalities only 
arises when data structures encoded as uninterpreted constants are compared. 
For the remaining rules we assume that caching was done, if needed, and cells 
can be compared via direct equality. 


Set Filter. In TLA*, the elements of a set S can be filtered by a predicate p via 
the expression {x € S : p}. This expression will create a set F which contains 
only the elements of S that satisfy p, e.g., {x € {-1,0,1}: x > 0} = {0,1}. Rule 
Filter reduces a filter to a new set cell, cr, whose arena overappoximation con- 
tains the elements of S, but whose constraints ensure that only filtered elements 
are members of F; ply/a] means that x is replaced by y in p and parentheses 
indicate the application of another rule, the predicate resolution rule in this case. 


{x € cg: p} : Set[r] and cgcy,..., Cn 
— (pci /a] : Bool, ...,p[¢n/z] : Bool > cf,...,c2 ) 
— cr |cr: Set[T],cpoc,...,Cn | FilterCtr 

(Filter) 
The constraints added use an array ae, initially unconstrained, i.e., the values 
mapped by all the indexes of af, are unconstrained, as opposed to a2, in 
EnumCtr. The values of al, mapped by indexes c1,...,Cņn are constrained by 
ci,...,¢? via array access, i.e., al, [ci] is asserted to be true or false based on 


c?, with 1 <i < n. We then apply pointwise conjunction to cg and a, via the 
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maps SMT operator; we go from aĵ, to a% to keep the array index in step with 
the arena overapproximation. Indexes whose values were false in S' remain so in 
F, and indexes whose values were true in S store the filter’s predicate evaluation. 


è 0 rs 
\ ite (cP, a? [ei], 7a, [c;]) Aag, = mapa (cs, af) A cr = at, (FilterCtr) 
l<i<n 
a ee 
predicate-based constraining pointwise conjunction cell equality 

Both encodings generate a linear amount of constraints, since n p|c;/x] pred- 
icates have to be considered. Unlike with EnumCtr, FilterCtr does not contain 
many store operations, due to the usage of maps. This avoids the need to create 
intermediary arrays, and is not possible in EnumCtr due to its constant array. 


Set Map. The expression {e : x € S} can be used to construct a set M from 
a set S, having all the elements of M as ely/z], with y € S. For example, the 
expression {x +5: x € {4,5,6}} yields the set {0,1}, with + denoting standard 
integer division. To reduce set map we use rule Map. 


{e: x € cs} : Set[r] and cg cy,..., Cn 
>= (elc1/z] :7,-.-,efcn/£]: T = c$ See CF) 
=> CM | CM : Set[T], cm—>c$,..., c$ | MapCtr 


(Map) 

The constraints added in rule Map are similar to those added in rule Enum. The 

difference between them is that set enumeration precisely defines the elements to 

be added to the new set cell, while set map is based on an existing set cell, which 

is a set overapproximation. Due to this, membership in M has to be guarded by 
membership in S, leading to a linear amount of constraints being generated. 


cs[ci], 
Or 23 5 į = i—l „e — ar 
Gey = Kr(L)A \ ite | acu = store(az, cf, T), | Nem = ay (MapCtr) 
1<i<n Gen, = aen 
—_-— 
empty set set updates cell equality 


3.2 Encoding TLA-+ Functions using Arrays 


We use arrays to encode TLA? functions directly as functions themselves. To do 
this, arrays are used in their general format, with a function f :s,, — Sm being 
encoded as an array of sort Ss} = $,,. Since functions with a finite domain can 
rely on infinite sorts, e.g., the integer numbers, the encoding of each function 
also includes constraints defining its domain set, by means of the rules described 
in the previous section; the result of a function application to a value outside 
its domain is undefined in TLAt. This approach allows us to generate SMT 
constraints that follow directly from TLAt, making the encoding not only more 
efficient, but also more natural to describe. In contrast, the constants encoding 
represents functions explicitly as sets of pairs of form {(a, f(x)) : £ E DOMAINS}. 
Due to this, its function manipulation relies on set manipulation, e.g., function 
comparison is encoded as set comparison, leading to a quadratic amount of 
constraints. The reduction rules used to handle functions are presented below. 
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Function Definition. The definition of a function in TLA* is an expression of 
the form [a € S + e], which maps every domain value v to the expression 
e[v/a]. This definition is similar to that of set map {e : x € S}, and thus 
generates constraints in a similar fashion to rule Map. The main difference is 
that the evaluations of the expression e[v/x] are stored as array values, rather 
than array indexes, i.e., function definition uses store(a, v,e[v/xz]) and set map 
uses store(a,e[u/x], T), with v being a value in the function’s domain or the 
set being mapped. Every encoded function has a single argument, with multiple 
arguments being rewritten as tuples in preprocessing. 

Unlike with set cells, a function cell cp in the arena does not directly point 
to its values, with the arrays encoding adding two edges to CF, cr—>Cr,,,, and 
Cr>Cr.,i,- Cell cr,,,, represents the function’s domain and cell €p „i, represents 
the set of pairs {(a, f(x)) : x € DOMAIN}. Cell cr n, despite being in the arena, 
has no SMT constraints modelling it in the arrays encoding, with its sole purpose 
being to help propagate the arena edges of the function’s codomain elements. 


Function Domain. Accessing a function’s domain is trivial in the arrays encod- 
ing, since the domain set is generated during function definition. This results in 
a simple access to the array representing the domain. 


Function Update. The update of a TLA* function f is done by changing the 
result of applying f to an argument arg, flarg], to be a given value v, via the 
expression |f EXCEPT! [arg] = v]. The update will produce a new function 
g which is identical f, except that glarg] = v if arg E€ DOMAIN. The arrays 
encoding generates a single array update constraint in this case. 


Function Application. The application of a function to an argument arg is con- 
ceptually simple, but is quite intricate to realize, as can be seen in rule FunApp. 
The arrays encoding uses an oracle to check that Carg is in the domain and to 
gather the arena edges of Cres. The FunAppCtr constraints ensure that the oracle 
chooses the correct index and equates the result cell to an array access on cr. 
Note that the value of Cres comes directly from the function application expres- 
sion itself, with the oracle only been needed to gather the arena edges of Cres, if 
m > 0, via c”. The need for an oracle is restricted to functions whose codomain 
contain structured data, e.g., f : Int — Set/Int]. If this is not the case, e.g., 
g : Int > Int, rule FunApp is simplified and FunAppCtr becomes Cres = CF |Carg]. 


CF |Carg] : T and cr—cr,,,>c#,...,¢4 and CFC Fy, >C}, were 
z (FROM ci,...,c2 BY 8 : (Targ T) |60: Int] 0< <n = c) 
and c?[2]>c1,...,Cm 
— Cres | Cres : T,CresC1,---;Cm | Fun AppCtr 
(FunApp) 
. d d 
VAN A (ao d Ea Atre = CH [Caral (FunAppCtr) 


l<i<n 
—_—_—_—_ Ss * 


oracle constraining cell equality 
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3.3 Correctness of the Reduction to Arrays 


Correctness of the ARS is given by four properties: finiteness of the models, 
compliance to the target SMT theories, termination of any reduction sequence, 
and soundness of the reductions. These properties have their correctness sketched 
for the constants encoding in [13], with detailed proofs present in [26]. Since we 
rely on the existing ARS and restrict our changes to mainly affect constraint 
generation, we have the same degree of overapproximation and the correctness 
arguments made for the constants encoding are in large part valid for the arrays 
encoding. We present below the definition of a KerAt model and detail, for each 
property, how the use of arrays affects the correctness arguments and how they 
can be adjusted to remain valid. 


Models. Every satisfiable KerAt formula has a model M = (D,Z), where D 
is the model domain, consisting of a disjoint union of sets Dj,...,D,, with Dj, 
1 < i < n, containing the values for type 7;, and Z is the model interpretation, 
consisting of assignments of domain values to KerA* constants. Models are used 
to access cell values, with the value of a KerA* expression e in model M being 
[e]™. In Sbefore~? Safter, We go from M before to Master, with M after containing 
the interpretation of additional constants and being thus an extension of M before- 


Finiteness. This property states that every interpretation of a KerA* expression 
is defined only over finite values. Its proof is derived from the finiteness of the 
elements being modelled. In the arrays encoding, we potentially use arrays with 
infinite sorts, e.g., the integers, but all SMT interpretations that can be derived 
from such arrays are finite, since we encode only finite TLA* data structures. 
This guarantees finiteness of all KerA*+ models in the arrays encoding. 


Theory Compliance. This property states that any sequence of states sg~>...~+ Sn 
has the formulas ®;, 1 <i < n, in the first-order logic fragment containing only 
quantifier-free expressions over uninterpreted functions and integer arithmetic. 
Its proof is done by induction on the constraints generated. The constraint ®o 
is always true and is thus trivially compliant. The inductive case is proved by 
showing that the constraint added by each rule are compliant. The rules in the 
arrays encoding only add array constraints, in addition to constraints supported 
by the constants encoding, so theory compliance is straightforward to guarantee. 


Termination. This property states that every sequence of ARS reductions is fi- 
nite, i.e., the reduction process always terminates. Its proof is based on ensuring 
that every rule r applied to a given state Spefore yields a state Softer With easter 
being smaller than €pefore. An expression’s length is given based on the length 
of its sub-expressions. The arrays encoding mainly changes constraint genera- 
tion, and in the cases where rules are slightly modified they generate resulting 
expressions of the same size, thus guaranteeing termination. 
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Soundness. This property is described in Theorem 1. Both e and ® are KerA + 
expressions, but ©® is in the first-order logic fragment supported by SMT solvers. 
Fundamentally, the ARS is rewriting a formula to forward it to the solver. The 
soundness proof consists of case analysis of each reduction rule to establish that 
Cbefore N before is equisatisfiable to Caster A Pafter, no matter the rule applied 
iN Stefore~*Safter- The case analysis, which describes how e€afer and after can 
be derived from €gefore and ®pyefore for each rule, relies on six invariants of the 
reduction system. Three invariants, 1, 3, and 4, are encoding independent, and 
thus are the same as in [13], the remaining three, 2, 5, and 6, are changed due to 
the new representation of sets and functions. Below we show all six invariants, 
with the modifications needed to guarantee soundness for the arrays encoding. 


Theorem 1. Let so~>...~>Sn be a sequence of states produced by the ARS, with 
Si = (ei | A; | vi | ®;) and1<i<n. Assume that eo is a formula, i.e., it has 
type Bool. Then eo is satisfiable iff the conjunction en A ®,, is satisfiable. 


Invariant 1 (type correctness) In every reachable state (e | A| v | ®) of the 
ARS, the expression e is well typed. 


Invariant 2 (arena membership) In every reachable state (e | A | v | ®) of 
the ARS, every cell c in either the expression e or the formula ® is also in A. 


Invariant 3 (model suitability) Let spefore~~Safter be a reachable transition 
in the ARS, and Mbefore be a suitable model for Spefore. An extended model 
M after from M before is suitable for Safter- 


Invariant 4 (overapproximation) Let (e| A |v | ®) be a reachable state of 
the ARS, and M be its model. Assume that Cset is a set cell in the arena A 
and that CsetC1,.-.,Cn are edges in A, for some n > 0. Then, it holds that 


[csee]™ © {Tea}, ..., [en]. 


Invariant 5 (function domain) Let (e| A |v | ©) be a reachable state of the 
ARS. Assume that cy is a function cell of type Sn — Sr, in the arena A. Then, 
there is a cell Cdom of type Sset[r,] such that cf ACaom- 


Invariant 6 (domain reduction) Let (e | A | v | ©) be a reachable state 
of the ARS, and M be its model. Assume that cy is a function cell and that 
CECE is in the arena A. Then, it follows that [cr,,,, | = [Doman f]. 


As described in sections 3.1 and 3.2, arrays precisely model TLA* sets and 
functions. The handling of sets revolves around membership constraints of form 
Cset(Ci], which and can be set to true or false via store. Regarding functions, func- 
tion application and update are trivially equivalent to array access and update. 
The more elaborate array operators also have a counterpart in TLA*. Constant 
arrays are equivalent to a function definition for which all range values are the 
same constant, and array map is equivalent to set map. These equivalences ex- 
plain how the changes in the arrays encoding do not invalidate the case analysis 
of the reduction rules used to prove Theorem 1, thus guaranteeing soundness. 
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4 Evaluation 


In order to evaluate the performance impact of the arrays-based encoding, we im- 
plemented it in the APALACHE model checker, which currently supports the con- 
stants encoding. Given a TLA* specification containing a property P, APALACHE 
is capable of performing bounded model checking up to a length k and, if P is 
an inductive invariant, it can check if the property holds with an unbounded 
length. In both modes, APALACHE checks if the SMT formula encoding the 
specification is satisfiable when conjoined with ~P, and if that is the case a 
counterexample (CEX) in the form of a trace is produced using the arena infor- 
mation and the satisfiable assignment provided by the SMT solver. Our imple- 
mentation adds new reduction rules to APALACHE, which can be enabled via a 
CLI flag. When enabled, these rules replace the existing ones encoding sets and 
functions, as described in Section 3. In addition, we also extended APALACHE’s 
CEX generation to handle assignments to SMT formulas containing arrays. We 
use Z3 [7] as our back-end solver. APALACHE is open-source and freely available’. 

We performed a number of experiments using APALACHE and the explicit- 
state model checker TLC. For APALACHE, we evaluated both its existing con- 
stants encoding and two versions of the arrays encoding we propose, called arrays 
and funArrays. The arrays version encodes both TLA™ sets and functions as ar- 
rays, while the funArrays version encodes only TLA* functions as arrays. The 
purpose of having two versions of our encoding is to evaluate the impact of en- 
coding sets and functions as arrays separately. Our evaluation setup consisted 
of a machine with 64 AMD EPYC 7452 processors and 256 GB of memory. We 
first present the benchmarks used and then discuss the results obtained. 


4.1 Benchmarks 


We consider the TLA* specifications of three asynchronous protocols as bench- 
marks. The first benchmark is a specification of the asynchronous Byzantine 
agreement protocol by Bracha and Toueg [5], showed in a simplified version in 
Figure 2, to which we refer as aba. The second benchmark is a specification of the 
consensus algorithm with Byzantine faults in one communication step by Dobre 
and Suri [9], to which we refer as cab. The third benchmark is a specification of 
the asynchronous non-blocking atomic commitment protocol by Guerraoui [12], 
to which we refer as nac. The common use of aba and cba is in replication scenar- 
ios with N = 3F +1 replica nodes to tolerate F failures, while the nac protocol is 
typically used for partitioned databases. The specifications are available online?. 


4.2 Results 


For each specification we check a variation of the agreement property. The results 
are shown in Figure 4. We can see that both arrays and funArrays scale in 


3 Available at https: //github.com /informalsystems/apalache 
t Available at https: //github.com/informalsystems/apalache-bench 
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performance better that the constants encoding, with an order of magnitude 
improvement for some instances. It is also worth pointing out that arrays and 
funArrays were able to reach a result before the time limit in 29 and 28 instances, 
respectively, while the constants encoding was able to do so in only 20 instances. 
In regards to TLC, it performed worse than the three APALACHE encodings in 
the nontrivial cases, only reaching a result before the time limit in 8 instances. 


5 Related Work 


An extensive discussion of works related to symbolic model checking for TLA + 
can be found in [13]. Here we focus exclusively on closely related publications. 
The IVy Prover [20] was designed to tackle verification of distributed algorithms 
with a decidable fragment of relational first-order logic. Some distributed algo- 
rithms, such as the one in Figure 2, cannot be directly expressed in this fragment 
however, due to the use of power sets and set cardinalities. Recent efforts have 
focused on offering support to reason about set cardinalities [4], but limitations 
remain. Cut-off based techniques to automatically infer invariants of distributed 
algorithms in the IVy language, such as relational abstractions of Paxos and 
two-phase commit, have been recently proposed [10,11]. Similar benchmarks 
are used in [22] to infer generalized invariants from finite instances of TLAT 
and semi-automatically prove invariants with TLAPS. Specifications of fault- 
tolerant distributed algorithms encoded as threshold automata can be efficiently 
verified with ByMC [15,24]. The manual rewriting of an algorithm into thresh- 
old automata is, however, usually beyond the skills of a typical TLAT user. The 
work closest to ours involves the use of SMT arrays to encode EventB and TLAT 
specifications in ProB [21]. The focus on ProB aims at handling infinite data 
structures, in contrast to our choice to work with bounded overapproximations. 
Reasoning about infinite domains implies the use of quantifiers, which prevents 
the use of efficient decision procedures available for the decidable fragment of 
SMT, with this approach been shown to underperform when compared against 
APALACHE in checking the benchmarks from [13]. An important last point to 
mention is that CVC5 has its own non-standard SMT theory of sets [1]. This 
theory, however, cannot currently handle nested sets, which is a very commonly 
used TLA? construct. It remains as a viable alternative to the SMT theory of 
arrays for the encoding of flat sets, but whose use implies important restrictions 
to the input language and, consequentially, to practical application. 


6 Conclusions 


We propose an encoding of the main TLA* constructs into the SMT theory of 
arrays, with the goal of providing the SMT solver with the structural information 
it needs to efficiently reach a solution. We implemented our encoding into the 
APALACHE model checker and our evaluation indicates that our arrays-based en- 
coding provides a significant performance improvement when compared against 
APALACHE’s existing SMT encoding and the explicit-state model checker TLC. 


0? 


Time in seconds 


0? 


Time in seconds 


ot 


Time in seconds 


ot 


Symbolic Model Checking for TLA+ Made 


—e— ARRAYS 
—=— FUNARRAYS 
—¢— CONSTANTS 


== TLC 


Instance size 


(a) Results for aba OK. 


—6— ARRAYS 
—5&— FUNARRAYS 
—@— CONSTANTS 


= TLC 


16 19 


Instance size 


(c) Results for cab OK. 


—e— ARRAYS 
—s}— FUNARRAYS 
—o— CONSTANTS 
TLC 


ae 


16 19 


Instance size 


(e) Results for nac OK. 


Time in seconds 


Time in seconds 


Time in seconds 


a 


ot 


02 


ot 


ot 


02+ 


Faster 141 


—e— ARRAYS 
—&— FUNARRAYS 
—— CONSTANTS 


= TLC 


Boe 
“I 


16 19 


Instance size 


(b) Results for aba NotOK. 


—o— ARRAYS 
—5— FUNARRAYS 
—¢— CONSTANTS 
TLC 


Instance size 


(d) Results for cab NotOK. 


—o— ARRAYS 
—5— FUNARRAYS 
—o— CONSTANTS 
TLC 


10 


Instance size 


(£) Results for nac NotOK. 


Fig. 4: Time in checking agreement for aba, cab, and nac. Specifications were ran 
in two configurations, one in which agreement is expected to hold (OK) and one 
in which it is not (NotOK). Instance size stands for the number of nodes used, 
and the time is given in seconds in logarithmic scale; Timeout (TO) is 1 hour. 
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Encoding the remaining TLA* constructs in a structure-preserving way, be it 
via SMT arrays or algebraic datatypes, remains an interesting research avenue. 
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Abstract. HyperLTL is a temporal logic that can express hyperprop- 
erties, i.e., properties that relate multiple execution traces of a system. 
Such properties are becoming increasingly important and naturally oc- 
cur, e.g., in information-flow control, robustness, mutation testing, path 
planning, and causality checking. Thus far, complete model checking 
tools for HyperLTL have been limited to alternation-free formulas, i.e., 
formulas that use only universal or only existential trace quantification. 
Properties involving quantifier alternations could only be handled in an 
incomplete way, i.e., the verification might fail even though the property 
holds. In this paper, we present AutoHyper, an explicit-state automata- 
based model checker that supports full HyperLTL and is complete for 
properties with arbitrary quantifier alternations. We show that language 
inclusion checks can be integrated into HyperLTL verification, which al- 
lows AutoHyper to benefit from a range of existing inclusion-checking 
tools. We evaluate AutoHyper on a broad set of benchmarks drawn from 
different areas in the literature and compare it with existing (incomplete) 
methods for HyperLTL verification. 


1 Introduction 


Hyperproperties [16] are system properties that relate multiple executions of 
a system. Such properties are of increasing importance as they naturally oc- 
cur, e.g., in information-flow control [36], robustness [22], linearizability [30,31], 
path planning [39], mutation testing [27], and causality checking [18]. A promi- 
nent logic to express hyperproperties is HyperLTL, which extends linear-time 
temporal logic (LTL) with explicit trace quantification [15]. HyperLTL can, for 
instance, express generalized non-interference (GNI) [34], stating that the high- 
security input of a system does not influence the observable output. 


Yr. Yr. Ir”. ( \ An az) A ( VAN An! ax) (GNI) 


acH ac LUO 


Here, H is a set of high-security input, L is a set of low-security inputs, and O is 
a set of low-security outputs. The formula states that for any traces 7,7’ there 
exists a third trace 7” that agrees with the high-security inputs of r and with the 
low-security inputs and outputs of 7’. Any observation made by a low-security 
attacker is thus compatible with every possible high-security input. 


© The Author(s) 2023 
S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 145-163, 2023. 
https: //doi.org/10.1007/978-3-031-30823-9 8 
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We are interested in the model checking (MC) problem of HyperLTL, i.e., 
whether a given (finite-state) system satisfies a given property. For HyperLTL, 
the structure of the quantifier prefix directly impacts the complexity of this 
problem. For alternation-free formulas (i.e., formulas that only use quantifiers of 
a single type), verification is well understood and is reducible to the verification 
of a trace property on a self-composition of the system [3]. This reduction has, 
for example, been implemented in MCHyper [29], a tool that can model check 
(alternation-free) HyperLTL formulas in systems of considerable size (circuits 
with thousands of latches). 

Verification is much more challenging for properties involving quantifier al- 
ternations (such as GNI from above). While MC algorithms supporting full 
HyperLTL exist (see [15,29]), they have not been implemented yet. Instead, 
over the years, a number of approaches to the verification of such properties in 
practice have been made: Finkbeiner et al. [29] and D’Argenio et al. [22] man- 
ually strengthen properties with quantifier alternation into properties that are 
alternation-free and can be checked by MCHyper. Coenen et al. [19] instantiate ex- 
istential quantification in a V*5* property (i.e., a property involving an arbitrary 
number of universal quantifiers followed by an arbitrary number of existential 
quantifiers, such as GNI) with an explicit (user-provided) strategy, thus reducing 
to the verification of an alternation-free formula. Alternatively, the strategy that 
resolves existential quantification can be automatically synthesized [7]. Hsu et 
al. [31] present a bounded model checking (BMC) approach for HyperLTL that 
is implemented in HyperQube. See Section 4 for more details. 

While all these verification tools can verify (or refute) interesting properties, 
they all suffer from the same fundamental limitation: they are incomplete. That 
is, for all the tools above, we can come up with verification instances where they 
fail, not because of resource constraints but because of inherent limitations in the 
underlying verification algorithm. Moreover, such instances are not rare events 
but are encountered regularly in practice. For example, many of the benchmarks 
used to evaluate HyperQube (by Hsu et al. [31]) do not admit a strategy to resolve 
existential quantification. Conversely, many of the properties verified by Coenen 
et al. [19] (such as GNI) cannot be verified using BMC [31]. 


AutoHyper. In this paper, we present AutoHyper, a model checker for Hyper- 
LTL. Our tool checks a hyperproperty by iteratively eliminating trace quantifi- 
cation using automata-complementations, thereby reducing verification to the 
emptiness check of an automaton [29]. Importantly — and different from previ- 
ous tools for HyperLTL verification such as MCHyper [29,19] and HyperQube [31] 
— AutoHyper can cope with (and is complete for) arbitrary HyperLTL formulas. 
Model checking using AutoHyper does not require manual effort (such as writing 
an explicit strategy in MCHyper [19]), nor does a user need to worry if the given 
property can even be verified with a given method. AutoHyper thus provides a 
“push-button” model checking experience for HyperLTL.! 


1 The name of AutoHyper is derived from the fact that it is both Automata-based 
and Automatic (i.e., it is complete and does not require any user intervention). 
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To improve AutoHyper’s efficiency, we make the (theoretical) observation 
that we can often avoid explicit automaton complementation and instead reduce 
to a language inclusion check on Biichi automata (cf. Proposition 1). On the 
practical side, this enables AutoHyper to resort to a range of mature language 
inclusion checkers, including spot [26], RABIT [17], BAIT [25], and FORKLIFT [24]. 


Evaluation. Using AutoHyper, we extensively study the practical aspects of 
model checking HyperLTL properties with quantifier alternations. To evalu- 
ate the performance of explicit-state model checking, we apply AutoHyper to 
a broad range of benchmarks taken from the literature and compare it with 
existing (incomplete) tools. We make the surprising observation that — at least 
on the currently available benchmarks — explicit-state MC as implemented in 
AutoHyper performs on-par (and frequently outperforms) symbolic methods such 
as BMC [31]. Our benchmarks stem from various areas within computer science, 
so AutoHyper should — thanks to its “push-button” functionality, completeness, 
and ease of use — be a valuable addition to many areas. 

Apart from using AutoHyper as a practical MC tool, we can also use it as 
a complete baseline to systematically evaluate existing (incomplete) methods. 
For example, while it is known that replacing existential quantification with a 
strategy (as done by Coenen et al. [19]) is incomplete, it was, thus far, unknown 
if this incompleteness occurs frequently or is merely a rare phenomenon. We use 
AutoHyper to obtain a ground truth and evaluate the strategy-based verification 
approach in terms of its effectiveness (i.e., how many instances it can verify 
despite being incomplete) and efficiency. 


Structure. The remainder of this paper is structured as follows. In Section 2, we 
introduce HyperLTL. We recap automata-based verification (which we abbrevi- 
ate ABV) and our new approach utilizing language inclusion checks in Section 3. 
We discuss alternative verification approaches for HyperLTL in Section 4. In Sec- 
tion 6, we compare different backend solving techniques and study the complexity 
of HyperLTL MC with multiple quantifier alternations in practice; In Section 7, 
we evaluate ABV on a set of benchmarks from the literature and compare with 
the bounded model checker HyperQube [31]; In Section 8 we use AutoHyper for 
a detailed analysis of (and comparison with) strategy-based verification [19,7]. 


2 Preliminaries 


We fix a set of atomic propositions AP and define X := 24”. HyperLTL [15] 
extends LTL with explicit quantification over traces, thereby lifting it from a logic 
expressing trace properties to one expressing hyperproperties [16]. Let V be a 
set of trace variables. We define HyperLTL formulas by the following grammar: 


Y =ar || YAp OY |YU y 
p := Ir. p | Yr. |Y 


where 7 € V and a € AP. 
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We assume that the formula is closed, i.e., all trace variables that are used 
in the body are bound by some quantifier. The semantics of HyperLTL is given 
with respect to a trace assignment IJ : V — X“ mapping trace variables to 
traces. For 7 € V and t € XY, we write [7 + t] for the trace assignment 
obtained by updating the value of m to t. Given a set of traces T C X“, a trace 
assignment JT, and i € N, we define: 


II i H ay if a € (r)(i) 

I, i | =y if M, ijy 

I, i H yi A we if Mi and H, i H pa 
HiH oy if Title 


I,i Kd U de iff Jj >i. I, j |H yz and Vi < k < j. I, k EW 


I |r y% if M,0 Hy 
I Fr ar. if Jte T. Hfr = t] Hr e 
II Fr Vr. p if vt € T. Hfr => t] Ere 


A transition system is a tuple T = (S, So, K, L) where S' is a set of states, 
So C S is a set of initial states, x C S x S is a transition relation, and LD: S > X 
is a labeling function. We write s 2 s’ whenever (s, s’) € «. A path is an infinite 
sequence s98152::: E S”, s.t., So E€ So, and si Js Si+1 for all i. The associated 
trace is given by L(so)L(s1)L(s2)+-- € XY. We write Traces(T) C © for the 
set of all traces generated by 7. We say J satisfies a HyperLTL property y, 
written T E y, if Ø Erraces(T) p, Where Ø denotes the empty trace assignment. 


3 Automata-based HyperLTL Model Checking 


Given a system 7 and HyperLTL property p, we want to decide whether T = y. 
In this section, we recap the automata-based approach to the model checking 
of HyperLTL [29]. We further show how language inclusion checks can be incor- 
porated into the model checking procedure to make use of a broad collection of 
mature language inclusion checkers. 


3.1 Automata-based Verification 


The idea of automata-based verification (ABV) [29] is to iteratively eliminate 
quantifiers and thus reduce MC to the emptiness check on an automaton. A 
non-deterministic Biichi automaton (NBA) is a tuple A = (Q, Qo, ô, F) where 
Q is a finite set of states, Qo C Q is a set of initial states, 6: Q x X + 22 is 
a transition function, and F C Q is a set of accepting states. We write L(A) C 
S™ for the language of A, i.e., all infinite words that have a run that visits 
states in F infinitely many times (see, e.g., [2]). For traces t1,...,tn E X”, we 
write zip(ti,...,tn) E€ (2")” as the pointwise product, i.e., zip(t1,...,tn)(i) := 


(t1 (2), tka ,tn(i)). 
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Let T = (S, So, K, L) be a fixed transition system and let p be some fixed 
closed HyperLTL formula (we use the dot to refer to the original formula and 
use y,y’ to refer to subformulas of ~). For some subformula y that contains 
free trace variables 71,...,7,, we say an NBA A over X” is T-equivalent to y, 
if for all traces t),...,tn it holds that [m1 > t1,...,an > tn] Etraces(r) Y iff 
zip(ti,...,;tn) E L(A). That is, A accepts exactly the zippings of traces that 
constitute a satisfying trace assignment for ¢. 

To check if T = ¢%, we inductively construct an automation Ay that is T- 
equivalent to y for each subformula y of ġ. For the (quantifier-free) LTL body 
of ġ, we can construct this automaton via a standard LTL-to-NBA construction 
[29,2]. Now consider some subformula y’ = Jr.p where vy’ contains free trace 
variables 71,...,7, and so y contains free trace variables 71,..., 7,7. We are 
given an inductively constructed NBA A, = (Q, Qo, ô, F) over ©”*! that is T- 
equivalent to y. We define the automaton Ap over X” as Ay := (S x Q, So x 
Qo, 0’, S x F) where 6’ is defined as 


'((s,4),(h, ah) — {(s'.d) |s J,A q' € 6(q,(h,.. -sIm L(s))) }. 


Informally, A, reads the zippings of traces t,,...,tn and guesses a trace t € 
Traces(7) such that zip(ti,...,tn,t) E€ L(A,). It is easy to see that A,’ is 
T-equivalent to y’. To handle universal trace quantification, we consider a for- 
mula y’ = Vr.y as “py! = 7dr.>y” and combine the construction for existential 
quantification with an automaton complementation. 

Following the inductive construction, we obtain an automaton A, over the 
singleton alphabet X° that is T-equivalent to ~. By definition of T-equivalence, 


TE ġ iff Ø Etraces(T) $ iff Ay is non-empty (which we can decide [21)). 


3.2 HyperLTL Model Checking by Language Inclusion 


The algorithm outlined above requires one complementation for each quantifier 
alternation in the HyperLTL formula. While we cannot avoid the theoretical 
cost of this complementation (see [36,15]), we can reduce to a, in practice, more 
tamable problem: language inclusion. 

For a system 7, and a natural number n € N we define A+ as an NBA over 
X” such that for any traces ti,...,tn E XY we have zip(ti,...,tn) € L(A) if 
and only if t; € Traces(T) for every 1 < i < n. We can construct A? by building 
the n-fold self-composition of T [3] and convert this to an automaton by moving 
the labels from states to edges and marking all states as accepting. We can now 
state a formal connection between language inclusion and HyperLTL MC (a 
proof can be found in the full version [9]): 


Proposition 1. Let g = Vm....Van.p be a HyperLTL formula (where p may 
contain additional trace quantifiers) and let A, be an automaton over &” that 


is T-equivalent to y. Then T = ¢ if and only if L(A) C L(Ay). 


We can use Proposition 1 to avoid a complementation for the outermost quan- 
tifier alternation. For example, assume = V71.V72.473.W where W is quantifier- 
free. Using the construction from Section 3.1, we obtain an automaton A3,, 4 
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that is T-equivalent to 473.~ (we can construct A3,,., in linear time in the size 
of T). By Proposition 1, we then have T | @ iff L(A?) C L( Aang): 

Note that complementation and subsequent emptiness check is a theoreti- 
cally optimal method to solve the (PSPACE-complete) language inclusion prob- 
lem. Proposition 1 thus offers no asymptotic advantages over “standard” ABV 
in Section 3.1. In practice constructing an explicit complemented automaton is 
often unnecessary as the language inclusion or non-inclusion might be witnessed 
without a complete complementation [26,25,17,24]. This makes Proposition 1 
relevant for the present work and the performance of AutoHyper. 


4 Related Work and HyperLTL Verification Approaches 


HyperLTL [15] is the most studied logic for expressing hyperproperties. A range 
of problems from different areas in computer science can be expressed as Hyper- 
LTL MC problems, including (optimal) path panning [39], mutation testing [27], 
linearizability [31], robustness [22], information-flow control [36], and causality 
checking [18], to name only a few. Consequently, any model checking tool for 
HyperLTL is applicable to many disciples within computer science and provides 
a unified solution to many challenging algorithmic problems. In recent years, dif- 
ferent (mostly incomplete) methods for the verification of HyperLTL have been 
developed. We discuss them below (see the full version [9] for details). 


Automata-based Model Checking. Finkbeiner et al. [29] introduce the automata- 
based model checking approach as presented in Section 3.1. For alternation-free 
formulas, the algorithms corresponds to the construction of the self-composition 
of a system [3] and is implemented in the MCHyper tool [29]. MCHyper can handle 
systems of significant size (well beyond the reach of explicit-state methods) but is 
unable to handle any quantifier alternation (the main motivation for AutoHyper). 
htltl2mc [15] is a prototype model checker for HyperLTL2 (a fragment of Hy- 
perLTL with at most one alternation) built on top of GOAL [38]. In contrast to 
htltl2mc, AutoHyper supports properties with arbitrarily many quantifier al- 
ternations and features automata with symbolic alphabets — which is important 
to handle large systems with many atomic propositions, cf. Footnote 7. 


Strategy-based Verification. Coenen et al. [19] verify V*3* properties by instan- 
tiating existential quantification with an explicit strategy. This method — which 
we refer to as strategy-based verification (SBV) — comes in two flavors: either the 
strategy is provided by the user or the strategy is synthesized automatically. In 
the former case, model checking reduces to checking an alternation-free formula 
and can thus handle large systems, but requires significant user effort (and is 
thus no “push-button” technique). In the latter case, the method works fully au- 
tomatically [8,7] but requires an expensive strategy synthesis. SBV is incomplete 
as the strategy resolving existentially quantified traces only observes finite pre- 
fixes of the universally quantified traces. While SBV can be made complete by 
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adding prophecy variables [7], the automatic synthesis of such prophecies is cur- 
rently limited to very small systems and properties that are temporally safe [5]. 
We investigate both the performance and incompleteness of SBV in Section 8. 


Bounded Model Checking. Hsu et al. [31] propose a bounded model checking 
(BMC) procedure for HyperLTL. Similar to BMC for trace properties [11], the 
system is unfolded up to a fixed depth, and pending obligations beyond that 
depth are either treated pessimistically (to show the satisfaction of a formula) 
or optimistically (to show the violation of a formula). While BMC for trace 
properties reduces to SAT-solving, BMC for hyperproperties naturally reduces to 
QBF-solving. As usual for bounded methods, BMC for HyperLTL is incomplete. 
For example, it can never show that a system satisfies a hyperproperty where 
the LTL body contains an invariant (as, e.g., is the case for GNI).? We compare 
AutoHyper and BMC (in the form of HyperQube [31]) in Section 7. 


5 AutoHyper: Tool Overview 


AutoHyper is written in F# and implements the automata-based verification ap- 
proach described in Section 3.1 and, if desired by the user, makes use of the 
language-inclusion-based reduction from Section 3.2. AutoHyper uses spot [26] 
for LTL-to-NBA translations and automata complementations. To check lan- 
guage inclusion, AutoHyper uses spot (which is based on determinization), RABIT 
[17] (which is based on a Ramsey-based approach with heavy use of simulations), 
BAIT [25], and FORKLIFT [24] (both based on well-quasiorders). AutoHyper is 
designed such that communication with external automata tools is done via es- 
tablished text-based formats (opposed to proprietary APIs), namely the HANOI 
[1] and BA automaton formats. New (or updated) tools that improve on fun- 
damental automata operations, such as complementation and inclusion checks, 
can thus be integrated easily. Internally we represent automata using symbolic 
alphabets (similar to spot). We store transition formulas as DNFs as this allows 
for very efficient SAT checks, which are needed during the product construction. 

All experiments in this paper were conducted on a Mac Mini with an Intel 
Core i3 (i3-8100B) and 16GB of memory. We used spot version 2.11.1; RABIT 
version 2.4.5; BAIT commit 369ela4; and FORKLIFT commit 5d519¢3. 


Input Formats. AutoHyper supports both explicit-state systems (given in a 
HANOT-like [1] input format) and symbolic systems that are internally converted 


? BMC for trace properties can be made complete by using bounds on the unrolling 
depth (also called completeness thresholds) [14] and including loop conditions in the 
encoding [11]. As remarked by Hsu et al. [31], the same is much more challenging 
for hyperproperties, and no solutions have been proposed. Instead, Hsu et al. [31] 
propose an alternative unrolling semantics (which they call halting semantics) that 
can mitigate this incompleteness issue for programs that terminate after a fixed 
number of steps. This is a strong (and often unrealistic) assumption for general 
reactive systems. 
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to an explicit-state representation. The support for symbolic systems includes 
Aiger circuits, symbolic models written in a fragment of the NuSMV input lan- 
guage [13], and a simple boolean programming language [6]. 


Random Benchmarks. For our evaluation, we use both existing instances from 
various sources in the literature and randomly generated problems.* We generate 
random transition systems based on the Erdés—Rényi-Gilbert model [28]. Given 
a size n and a density parameter p € [0,1], we generate a graph with n states, 
where for every two states s, s', there is a transition s — s’ with probability p. To 
generate a graph with n edges and, in expectation, constant outdegree of k, we 
can choose p = E, We further ensure that the system is connected and all states 
have at least one outgoing edge. We generate random HyperLTL formulas (with 
a given quantifier prefix) by sampling the LTL matrix using spot’s randltl. 


6 HyperLTL Model Checking Complexity in Practice 


Before we turn our attention to benchmarks found in the literature, we compare 
the different backend inclusion checkers supported by AutoHyper by evaluating 
them on a large set of synthetic (random) benchmarks (in Section 6.1). More- 
over, the random generation of benchmarks allows us to peek at formulas with 
more than one quantifier alternation. The theoretical hardness of model check- 
ing properties with multiple alternations has been studied extensively [15,36], 
and we analyze, for the first time, how these results transfer to practice (in 
Section 6.2). 


6.1 Performance of Inclusion Checkers 


As the first set of benchmarks, we compare the different backend inclusion check- 
ers supported by AutoHyper. In Figure 1, we depict how many instances can be 
solved using the inclusion checks of spot, BAIT, RABIT, and FORKLIFT within 
a timeout of 10s and give the median running time used on the instances that 
could be solved within the timeout. We observe that spot clearly outperforms 
RABIT, BAIT, and FORKLIFT in terms of the percentage of instances that can be 
checked within 10s. While, in general, spot solves the most instances, a manual 
inspection reveals that there are also instances that can only be solved by RABIT 


3 The advantage of randomly generated instances is twofold. First, it allows for the 
easy generation of a large set of benchmarks. Second, the random generation is 
parameterized by multiple parameters (such as system size, transition density, for- 
mula size, etc.), enabling a comprehensive analysis of the exact impact of different 
parameters on the model checking complexity in practice. 

4 We remark that spot operates on automata with a symbolic alphabet (i.e., tran- 
sitions are defined as boolean formulas over AP). In contrast, RABIT, BAIT, and 
FORKLIFT only support explicit alphabets (i.e., automata with one symbol for each 
element in 24”). 
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Fig. 1: We evaluate different backend solvers on instances of varying system size 
with an (on average) constant outdegree of 10 and a fixed property size of 20. 
We generate 20 samples per system size. We display both the success rate of 
each solver within a timeout of 10s (on the left axis) and the median running 
time on the solved instances (on the right axis). 


or BAIT/FORKLIFT. This justifies why AutoHyper supports multiple backed in- 
clusion checkers that implement different algorithms and thus excel on different 
problems (we will confirm this in Section 7). Moreover, our experiments pro- 
vide evidence that HyperLTL MC is a natural source for challenging language 
inclusion benchmarks (see the full version [9]). 

We remark that we set the timeout of 
10s deliberately low to compute (and re- E 4th Complementation fi 


produce) the plots in a reasonable time 250 3rd Complementation 
2nd Complementation 


1st Complementation 


(computing Figure 1 took about 3.5h). If 
a user wants to verify a given instance and Sua 
does not require a result within a few sec- 
onds, running the solver for even longer 
will likely increase the success rate further 


150 
(see also the evaluation in Section 7). i 
100 
6.2 Model Checking Beyond V*i* 
Using randomly generated benchmarks, a 
we can also peek at the practical com- 
0 
1 2 3 4 


plexity of model checking in the presence 
of multiple quantifier alternations. In the- Quantifier Alternations 
ory, the model checking complexity of Hy- 
perLTL increases by one exponent with 
each quantifier alternation [15,36]. Using 
AutoHyper, we can, for the first time, in- 
vestigate the model checking complexity 
in practice. 

We model check randomly generated formulas with 1 to 4 quantifier alterna- 
tions and visualize the total running time based on the cost of each complementa- 
tion (using spot) in Figure 2 (recall that checking a formula with k alternations 


Time in ms 


Fig. 2: For properties with a vary- 
ing number of quantifier alterations, 
we display the average time spent on 
the automata complementation dur- 
ing model checking. 
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Table 1: We depict the running time of AutoHyper when verifying GNI on the 
boolean programs taken from [6] and [10]. We give the program, the bitwidth 
(bw), the size of the intermediate explicit-state representation (Size), and the 
time taken by each solver. The timeout is set to 60s and indicated by a “-”. The 
property holds in all cases. Times are given in seconds. 


Program bw Size tspot taasrr tearr tror Program bw Size tspot trasrr tearr trorkLIFT 


Lbit 17 0.52 0.59 0.80 0.61 [10].1 1-bit 5 0.52 0.56 0.58 0.57 
By 6h 0,56 i a o Fa l-bit 11 0.51 0.57 0.72 0.61 

obit 129 0:99 5:51 = - [10.2 2bit 27 0.52 0.65 35.7 5.43 
6ļ}.2  1-bit 55 0.53 0.69 - 5.49 Abit 291 1.46 - - =: 
g3 l-bit 20 0.52 0.61 3.05 0.98 iios Lbit 21 0.52 0.60 3.15 1.00 
f 3-bit 80 0.61 1.31 - 7 : 3-bit 225 - 45.2 - - 
6.4 L-bit 29 0.52 0.56 0.58 0.57 noja Lbit 25 0.52 0.71 12.8 1.63 


3-bit 113 0.67 1.74 - - 3-bit 193 0.98 - - - 


using ABV requires k automaton complementations). Although the number of 
quantifier alternations has an undeniable impact on the total running time (the 
cumulative height of each bar), the increase in runtime is not proportional to the 
(non-elementary) increase suggested by the theoretical analysis. Different from 
the theoretical analysis (where the (k + 1)th complementation is exponentially 
more expensive than the kth), the cost of each complementation barely increases 
(or even decreases). This suggests that the T-equivalent automata constructed 
in each iteration are, in practice, much smaller than indicated by the worst-case 
theoretical analysis. Verification of properties beyond one alternation is thus less 
infeasible than the theory suggests (at least on randomly generated test cases). 


7 Evaluation on Symbolic Systems 


In this section, we challenge AutoHyper with complex model checking prob- 
lems found in the literature. Our benchmarks stem from a range of sources, 
including non-interference in boolean programs [6], symmetry in mutual exclu- 
sion algorithms [19], non-interference in multi-threaded programs [37], fairness 
in non-repudiation protocols [32], mutation testing [27], and path planning [39]. 


7.1 Model Checking GNI on Boolean Programs 


We use AutoHyper to verify GNI on a range of boolean programs that process 
high-security and low-security inputs (taken from [6,10]). Table 1 depicts the 
runtime results using different backend solvers. We test each program with vary- 
ing bitwidth and depict the largest bitwidth that can be solved by at least one 
solver (within a timeout of 60s). We, again, note that spot performs better than 
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Table 2: We evaluate HyperQube and AutoHyper on the benchmarks from [31]. 
We list the system and the property (as given in [31, Table 2]), the quantifier 
structure (Q*), the verification result (Res) (vV indicates that the property holds 
and X that it is violated), and the total running time of either tool (t). For 
HyperQube, we additionally list the unrolling bound (k) and the unrolling se- 
mantics (Sem). For AutoHyper, we additionally list the size of the intermediate 
explicit state space (Size). Times are given in seconds. 


HyperQube [31] AutoHyper 
System Spec Q* Res k Sem t Size t 
Bakerys3 psi J3 Kx 7 pes 1.9 167 2.3 
Bakery3 ps2 Va KX 12 pes 2.0 167 4.2 
Bakery yss iA X 20 pes 2.8 167 34.6 
Bakery3 Psymı VA X 10 pes 1.7 167 16.2 
Bakery3 Psym2 J3 xX 10 pes 1.6 167 2.9 
Bakerys Psym1 J3 X 10 pes 17.3 996 282.1 
Bakerys YPsym2 J3 X 10 pes 18.2 996 18.0 
SNARK-bug1 Quin 3 X 26  hpes 618.0 4941 96.1 
3-Threadcorrect PNI gJ "A 10 hopt 1.6 64 1.3 
3-Threadincorrect PNI 3 X 57 hpes 12.8 368 7.7 
NRP : Teorrect Pfair AV vo 15 hopt 1.3 55 0.5 
NRP : Tincorrect Yfair IV V' 15 hopt 1.4 54 0.8 
Mutant Pmut 3v <v 8 hopt 1.1 32 0.8 


other inclusion checkers and, in particular, scales better when the size of the sys- 
tem increases. Note that the number of atomic propositions is 3 in all instances, 
so spot’s support for symbolic alphabets has a negligible impact on the running 
time. We emphasize that not all instances in Table 1 can be verified using SBV 
[19,7] without a user-provided fixed lookahead. Likewise, BMC [31] can never 
verify GNI. This provides further evidence why complete model checking tools 
(of which AutoHyper is the first) are necessary. 


7.2 Explicit Model Checking of Symbolic Systems 


In this section, we evaluate AutoHyper on challenging symbolic models (NuSMV 
models [13]) that were used by Hsu et al. [31] to evaluate HyperQube. 

The properties we verify cover a wide range of properties. For example, we 
verify that Lamport’s bakery algorithm [33] does not satisfy various symmetry 
properties (as the algorithm prioritizes processes with a lower ticket ID); We 
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check linearizability’ [30] on the SNARK datastructure [23] and identify a pre- 
viously known bug; And, we generate model-based mutation test cases using the 
approach proposed by Fellner et al. [27]. Further details on the benchmarks are 
provided in [31]. 

We check each instance using both HyperQube and AutoHyper and depict 
the results in Table 2.° When using AutoHyper we always apply spot’s inclu- 
sion checker.’ For HyperQube we use the unrolling semantics and unrolling depth 
listed in [31, Table 2]. We observe that for most instances — despite using explicit 
state methods and thus being complete (cf. Section 7.4) — AutoHyper performs 
on par with HyperQube. On instances using Lamport’s bakery algorithm, BMC 
only needs to unroll to very shallow depths, resulting in very efficient solving, 
whereas AutoHyper’s running time is dominated by spot’s LTL-to-NBA transla- 
tion (consuming up to 98% of the total time). Conversely, on the large SNARK 
example, AutoHyper performs significantly better. 


7.3 Hyperproperties for Path Planning 


As a last set of benchmarks, we use planning problems for robots encoded into 
HyperLTL as proposed by Wang et al. [39]. For example, the synthesis of a 
shortest path can be phrased as a JV property that states that there exists a 
path to the goal such that all alternative paths to the goal take at least as long. 
Wang et al. [39] propose a solution to check the resulting HyperLTL property 
by encoding it in first-order logic, which is then solved by an SMT solver. While 
not competitive with state-of-the-art planning tools, HyperLTL allows one to 
express a broad range of problems (shortest path, path robustness, etc.) in a 
very general way. Hsu et al. [31] observe that the QBF encoding implemented 
in HyperQube outperforms the SMT-based approach by Wang et al. [39]. In this 
section, we evaluate AutoHyper on these planning-hyperproperties and compare 
it with HyperQube®. 

We depict the results in Table 3. It is evident that AutoHyper outperforms 
HyperQube, sometimes by orders of magnitude. This is surprising as planning 
problems (which are essentially reachability problems) on symbolic systems should 
be advantageous for symbolic methods such as BMC. The large size of the in- 


5 Linearizability asserts that any execution of a concurrent data structure corresponds 
to a sequential execution, which is naturally expressed as a V4 hyperproperty. 

6 For the two verification instances (Bakery3,s3) and (NRP : Tincorrects Pfair) 
HyperQube provides the wrong verification result. We mark such instances with a 
“I” to avoid confusion when comparing Table 2 with [31, Table 2]. In particular, the 
supposedly unfair version of the NRP protocol is, in fact, fair. 

7 The automata use a symbolic alphabet with up to 18 letters. A conversion to an 
explicit alphabet — as required for RABIT, BAIT, and FORKLIFT — is thus infeasible 
(this would require 2! symbols). 

8 AutoHyper is intended as a model checking tool, i.e., it only checks if a property 
holds or is violated. However, as we show in the full version [9], we could use the 
counterexamples returned by the inclusion checker to synthesize an actual plan. 
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Table 3: We evaluate HyperQube and AutoHyper on hyperproperties that encode 
the existence of a shortest path (Ysp) and robust path (y,,). We give the specifi- 
cation (Spec), the size of the grid (Grid), and the times taken by HyperQube and 
AutoHyper (t). For HyperQube, we additionally give the unrolling depth used 
(k) and the file size of the QBF generated (|QBF]). For AutoHyper, we addition- 
ally give the size of the generated explicit state space (Size). Times are given in 
seconds. The timeout is set to 20 min and indicated by a “-”. 


HyperQube [31] AutoHyper 
Spec Grid k |QBF| t Size t 

10 x 10 20 8 MB 4.6 146 0.7 

20 x 20 40 26 MB 168.1 188 1.5 
P 40 x 40 80 - - 408 22.7 

60 x 60 120 - - 404 88.8 

10 x 10 20 13 MB 4.2 266 0.6 
Prp 20 x 20 40 84 MB 22.4 572 0.7 

40 x 40 80 419 MB 265.0 1212 1.6 

60 x 60 120 - - 1852 3.7 


termediate QBF indicates that a more optimized encoding (perhaps specific to 
path planning) could improve the performance of BMC on such examples. 


7.4 Bounded vs. Explicit-State Model Checking 


Bounded model checking has seen remarkable success in the verification of trace 
properties and frequently scales to systems whose size is well out of scope for 
explicit-state methods [20]. Similarly, in the context of alternation-free hyper- 
properties, symbolic verification tools such as MCHyper [29] (which internally 
reduces to the verification of a circuit using ABC [12]) can verify systems that 
are well beyond the reach of explicit-state methods. In contrast, in the context 
of model checking for hyperproperties that involve quantifier alternations, our 
findings make a strong case for the use of explicit-state methods (as implemented 
in AutoHyper): 

First, compared to symbolic methods (such as BMC), explicit-state model 
checking is currently the only method that is complete. While BMC was able to 
verify or refute all properties in Tables 2 and 3, many instances cannot be solved 
with the current BMC encoding. As a concrete example, BMC can never verify 
formulas whose body contains simple invariants (such as GNI) and can thus not 
verify any of the instances in Table 1. The most significant advantage of explicit- 
state MC (as implemented in AutoHyper) is thus that it is both push-button and 
complete, i.e., it can — at least in theory — verify or refute all properties. 
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Second, the performance of AutoHyper seems to be on-par with that of BMC 
and frequently outperforms it (even by several orders of magnitude, cf. Table 3). 
We stress that this is despite the fact that for the evaluation of HyperQube we 
already fix an unrolling depth and unrolling semantics, thus creating favorable 
conditions for HyperQube.? While BMC for trace properties reduces to SAT solv- 
ing, BMC of hyperproperties reduces to QBF solving; a problem that is much 
harder and has seen less support by industry-strength tools. It is, therefore, un- 
clear whether the advance of modern QBF solvers can improve the performance 
of hyperproperty BMC, to the same degree that the advance of SAT solvers 
has stimulated the success of BMC for trace properties. Our findings seem to 
indicate that, at the moment, QBF solving (often) seems inferior to an explicit 
(automata-based) solving strategy. 


8 Evaluating Strategy-based Verification 


So far, we have used AutoHyper to check hyperproperties on instances arising in 
the literature. In this last section, we demonstrate that AutoHyper also serves 
as a valuable baseline to evaluate different (possibly incomplete) verification 
methods. Here we focus on strategy-based verification (SBV), i.e., the idea of 
automatically synthesizing a strategy that resolves existential quantification in 
V*3* HyperLTL properties [19,7]. 


8.1 Effectiveness of Strategy-based Verification 


SBV is known to be incomplete [19,7]. However, due to the previous lack of 
complete tools for verifying V*i* properties, a detailed study into how effective 
SBV is in practice was impossible on a larger scale (i.e., beyond hand-crafted 
examples). With AutoHyper, we can, for the first time, rigorously evaluate SBV. 
We use the SBV implementation from [7], which synthesizes a strategy for the 
J-player by translating the formula to a deterministic parity automaton (DPA) 
[35] and phrases the synthesizes as a parity game. 

We have generated random transition systems and properties of varying sizes 
and computed a ground truth using AutoHyper. We then performed SBV (recall 
that SBV can never show that a property does not hold and might fail to estab- 
lish that it does). We find that for our generated instances, the property holds 
in 61.1% of the cases, and SBV can verify the property in 60.4% of the cases. 
Successful verification with SBV is thus possible in many cases, even without 
the addition of expensive mechanisms such as prophecies [7]. On the other hand, 
our results show that random generation produces instances (albeit not many) 


? In Tables 2 and 3, we perform a single query with a fixed unrolling depth k and 
semantics, i.e., we already know if we want to show satisfaction or violation and the 
depth needed to show this (as done in [31]). In a classical BMC loop, we would check 
for satisfaction and violation with an incrementally increasing unrolling depth and 
thus perform roughly 2k many QBF queries where k is the least bound for which 
satisfaction or violation can be established (if this bound even exists). 
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on which SBV fails (so far, examples where SBV fails required careful construc- 
tion by hand). Reverting to SBV as the default verification strategy is thus not 
possible, further strengthening the case for complete model checking tools (of 
which AutoHyper is the first). 


8.2 Efficiency of Strategy-based Verification 


After having analyzed the ef- 
fectiveness of SBV (i.e., how 
many instances can be veri- 


é © Inclusion Check 
fied), we turn our attention to , 
. 400 = Product Construction 
the efficiency of SBV. In the- E LTLto NBA 


ory, (automata-based) model 
checking of V*3* HyperLTL — 
as implemented in AutoHyper 
— is EXPSPACE-complete in 
the specification and PSPACE- 
complete in the size of the 
system [15,36]. Conversely, 
SBV is 2-EXPTIME-complete 
in the size of the specifica- 
tion but only PTIME in the 
size of the system [19]. Con- 
sequently, one would expect 
that ABV fares better on 
larger specifications and SBV System Size 


fares better on larger systems 
(the more important measure Fig. 3: We compare ABV (AutoHyper) and SBV 


in practice). ([7]) on instances of varying system size. We fix 

However, in this section, the property size to 20. We generate 100 random 
we show that this does not instances for each size and take the average over 
translate into practice (at the fastest L instances, where L is the minimum 
least using the current imple- number of instances solved within a 5s timeout 


mentation of SBV [7]). We by both methods. 

compare the running time 

of AutoHyper (ABV) (using 

spot’s inclusion checker) and SBV. We break the running time into the three 

main steps for each method. For ABV, this is the LTL-to-NBA translation, the 

construction of the product automaton, and the inclusion check. For SBV, it is 

the LTL-to-DPA translation, the construction of the game, and the game-solving. 
We depict the average cost for varying system sizes in Figure 3. We observe 

that SBV performs worse than ABV and, more importantly, scales poorly in the 

size of the system. This is contrary to the theoretical analysis of ABV and SBV. 

As the detailed breakdown of the running time suggests, the poor performance 

is due to the costly construction of the game and the time taken to solve the 

game. An almost identical picture emerges if we compare ABV in SBV relative 
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to the property size (we give a plot in the full version [9]). While, in this case, the 
results match the theory (i.e., SBV scales worse in the size of the specification), 
we find that the bottleneck for SBV is not the LTL-to-DPA translation (which, in 
theory, is exponentially more expensive than the LTL-to-NBA translation used 
in ABV), but, again the construction and solving of the parity game. 

We remark that the SBV engine we used |7] is not optimized and always 
constructs the full (reachable) game graph. The poor performance of SBV can 
be attributed to the fact that the size of the game does, in the worst case, 
scale quadratically in the size of the system (when considering V!3! properties). 
This is amplified in dense systems (i.e., systems with many transitions), as, with 
increasing transition density, the size of the parity games approaches its worst- 
case size (see the full version [9]). In contrast, the heavily optimized inclusion 
checker (in this case spot) seems to be able to check inclusion in almost constant 
time (despite being exponential in theory). This efficiency of mature language 
inclusion checkers is what enables AutoHyper to achieve remarkable performance 
that often exceeds that of symbolic methods such as BMC (cf. Section 7) and 
further strengthens the practical impact of Proposition 1. 


9 Conclusion 


In this paper, we have presented AutoHyper, the first complete model checker 
for HyperLTL with an arbitrary quantifier prefix. We have demonstrated that 
AutoHyper can check many interesting properties involving quantifier alterna- 
tions and often outperforms symbolic methods such as BMC, sometimes by 
orders of magnitude. We believe the biggest advantage of AutoHyper to be its 
push-button functionality combined with its completeness: As a user, one does 
not need to worry whether AutoHyper is applicable to a particular property (dif- 
ferent from, e.g., SBV or BMC) and does not need to provide hints (e.g., in the 
form of explicit strategies in SBV). 

Apart from evaluating AutoHyper’s performance on a range of benchmarks, 
we have used AutoHyper to (1) compare various backend language inclusion 
checkers, (2) explore practical verification beyond one quantifier alternation 
(which is not as infeasible as suggested by the theory), and (3) rigorously eval- 
uate the effectiveness and efficiency of strategy-based verification in practice 
(which, different than suggested by a theoretical analysis, performs worse than 
automata-based methods). 
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Abstract. Given a machine learning (ML) model and a prediction, ex- 
planations can be defined as sets of features which are sufficient for the 
prediction. In some applications, and besides asking for an explanation, 
it is also critical to understand whether sensitive features can occur in 
some explanation, or whether a non-interesting feature must occur in all 
explanations. This paper starts by relating such queries respectively with 
the problems of relevancy and necessity in logic-based abduction. The 
paper then proves membership and hardness results for several families 
of ML classifiers. Afterwards the paper proposes concrete algorithms for 
two classes of classifiers. The experimental results confirm the scalability 
of the proposed algorithms. 


Keywords: Formal Explainability - Abduction - Abstraction Refine- 
ment. 


1 Introduction 


The remarkable achievements in machine learning (ML) in recent years [12,32,47] 
are not matched by a comparable degree of trust. The most promising ML models 
are inscrutable in their operation. As a direct consequence, the opacity of ML 
models raises distrust in their use and deployment. Motivated by a critical need 
for helping human decision makers to grasp the decisions made by ML models, 
there has been extensive work on explainable AI (XAI). Well-known examples 
include so-called model agnostic explainers or alternatives based on saliency 
maps for neural networks [9,50, 58,59]. While most XAI approaches do not offer 
guarantees of rigor, and so can produce explanations that are unsound given 
the underlying ML model, there have been efforts on developing rigorous XAI 
approaches over the last few years [40, 54,63]. Rigorous explainability involves 
the computation of explanations, but also the ability to answer a wide range of 
related queries [7, 8, 36]. 

By building on the relationship between explainability and logic-based ab- 
duction [25,30,40,61], this paper analyzes two concrete queries, namely feature 
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necessity and relevancy. Given an ML classifier, an instance (i.e. point in feature 
space and associated prediction) and a target feature, the goal of feature neces- 
sity is to decide whether the target feature occurs in all explanations of the given 
instance. Under the same assumptions, the goal of feature relevancy is to decide 
whether a feature occurs in some explanation of the given instance. This paper 
proves a number of complexity results regarding feature necessity and relevancy, 
focusing on well-known families of classifiers, some of which are widely used in 
ML. Moreover, the paper proposes novel algorithms for deciding relevancy for 
two families of classifiers. The experimental results demonstrate the scalability 
of the proposed algorithms. 

The paper is organized as follows. The notation and definitions used through- 
out are presented in Section 2. The problems of feature necessity and relevancy 
are studied in Section 3, and example algorithms are proposed in Section 4. 
Section 5 presents experimental results for a sample of families of classifiers, 
Section 6 relates our contribution with earlier work and Section 7 concludes the 
paper. 


2 Preliminaries 


Complexity classes, propositional logic & quantification. The paper 
assumes basic knowledge of computational complexity, namely the classes of 
decision problems P, NP and Lf [6]. The paper also assumes basic knowledge 
of propositional logic, including the Boolean satisfiability (SAT) problem for 
propositional logic formulas in conjunctive normal form (CNF), and the use of 
SAT solvers as oracles for the complexity class NP. The interested reader is 
referred to textbooks on these topics [6, 13]. 


2.1 Classification Problems 


Throughout the paper, we will consider classifiers as the underlying ML model. 
Classification problems are defined on a set of features (or attributes) F = 
{1,...,m} and a set of classes K = {c1,co,...,cK}. Each feature i € F takes 
values from a domain D;. Domains are categorical or ordinal, and each domain 
can be defined on boolean, integer /discrete or real values. Feature space is defined 
as F =D, x Dz x... x Dm. The notation x = (21,...,2m) denotes an arbitrary 
point in feature space, where each zx; is a variable taking values from D;. The set 
of variables associated with the features is X = {z£1,..., £m}. Also the notation 
v = (v1,..., Um) represents a specific point in feature space, where each v; is a 
constant representing one concrete value from D;. A classifier C is characterized 
by a (non-constant) classification function « that maps feature space F into the 
set of classes K, i.e. k : F + K. An instance denotes a pair (v,c), where v € F 
and c € K, with c= K(v). 
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2.2 Examples of Classifiers 


The results presented in the paper apply to a comprehensive range of widely used 
classifiers [28,62]. These include, decision trees (DTs) [18,42], decision graphs 
(DGs) [44] and diagrams (DDs) [1, 68], decision lists (DLs) [38, 60] and sets 
(DSs) [19,41], tree ensembles (TEs) [37], including random forests (RFs) [17,43] 
and boosted trees (BTs) [29], neural networks (NNs) [56], naive bayes classifiers 
(NBCs) [45,52], classifiers represented with propositional languages, including 
deterministic decomposable negation normal form (d-DNNFs) [23,35] and its 
proper subsets, e.g. sentential decision diagrams (SDDs) [22,66] and free binary 
decision diagrams (FBDDs) [23,31,68], and also monotonic classifiers. In the rest 
of the paper, we will analyze some families of classifiers in more detail. 
d-DNNF classifiers. Negation normal form (NNF) is a well-known proposi- 
tional language, where the negation operators are restricted to atoms, or inputs. 
Any propositional formula can de reduced to NNF in polynomial time. Let the 
support of a node be the set of atoms associated with leaves reachable from 
the outgoing edges of the node. Decomposable NNF (DNNF) is a restriction of 
NNF where the children of AND nodes do not share atoms in their support. 
A DNNF circuit is deterministic (referred to as d-DNNF) if any two children 
of OR nodes cannot both take value 1 for any assignment to the inputs. Re- 
strictions of NNF including DNNF and d-DNNF exhibit important tractability 
properties [23]. Besides, we briefly introduce FBDDs which is a proper subset 
of d-DNNFs. An FBDD over a set X of Boolean variables is a rooted, directed 
acyclic graph comprising two types of nodes: nonterminal and terminal. A non- 
terminal node is labeled by a variable x; € X, and has two outgoing edges, one 
labeled by 0 and the other by 1. A terminal node is labeled by a 1 or 0, and has 
no outgoing edges. For a subgraph rooted at a node labeled with a variable x;, 
it represents a boolean function f which is defined by the Shannon expansion: 
f = (11A fle: =1) V Oxi A fle:=0), where fle:=1 (flz;=0) denotes the cofactor [16] 
of f with respect to x; = 1 (x; = 0). Moreover, any FBDD is read-once, meaning 
that each variable is tested at most once on any path from the root node to a 
terminal node. 


Monotonic classifiers. Monotonic classifiers find a number of important ap- 
plications, and have been studied extensively in recent years [26, 48, 65, 70]. 
Let =< denote a partial order on the set of classes K. For example, we assume 
cı X Co X...cK. Furthermore, we assume that each domain D; is ordered such 
that the value taken by feature i is between a lower bound A(i) and an upper 
bound p(t). Given vı = (v11,.--,U1i;---; Vim) and vg = (va1,...,2i,---,U2m); 
we say that vı < v2 if V(t € F).(v1; < vi). Finally, a classifier is monotonic if 
whenever vı < Vo, then K(v1) < K(v2). 

Running examples. As hinted above, throughout the paper, we will consider 
two fairly different families of classifiers, namely classifiers represented with d- 
DNNFs and monotonic classifiers. 

Example 1. The first example is the d-DNNF classifier Cı shown in Fig. 1. It 
represents the boolean function (x1 A (x2 V %4)) V (“x1 A z3 A z4). The instance 
considered throughout the paper is (v1, c,) = ((0, 1, 0,0), 0). 
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Fi = {1,2,3,4} 
Dı: = {0,1}, i= 1,...,4 
Kı = {0,1} 


(b) Definition of F,,Di;, Ki 


IF zı =1Az2=1 THEN 1 
ELSE IF 2, = 1 A z4 = 1 THEN 1 
ELSE IF zs = 1 A z4 = 1 THEN 1 
ELSE 0 


(c) Alternative representation of kı 
(a) Graphical representation of d-DDNF, i.e. K1 


Fig. 1: Example of d-DDNF classifier 


Fo = {1,2,3,4} 
na ria 1 if z1 + £2 + £3 > 2 
2 = {0,1}, i = 1,...,4 k2(x) = : 
Kz = {0,1} 0 otherwise 
(a) Definition of F2, Dai, K2 (b) Definition of K2 


Fig. 2: Example of a monotonic classifier 


Example 2. The second running example is the monotonic classifier Cə shown 
in Fig. 2. The instance that is considered throughout the paper is (v2,c2) = 
((1,1,1,1), 1). 


2.3 Formal Explainability 


Prime implicant (PI) explanations [63] represent a minimal set of literals (relat- 
ing a feature value x; and a constant v; € D;) that are logically sufficient for 
the prediction. Pl-explanations are related with logic-based abduction, and so 
are also referred to as abductive explanations (AXp’s) [54]. AXp’s offer guaran- 
tees of rigor that are not offered by other alternative explanation approaches. 
More recently, AXp’s have been studied in terms of their computational com- 
plexity [7,10]. There is a growing body of recent work on formal explana- 
tions [3-5, 14, 15, 24,27, 33, 51,54, 67]. 
Formally, given v = (v1,..., Um) € F, with «(v) = c, an AXp is any subset- 
minimal set ¥ C F such that, 
WAXp(X) := V(x EF). [Aex (ti = vi)] (K(x) = 0) (1) 
If a set X C F is not minimal but (1) holds, then æ is referred to as a weak 
AXp. Clearly, the predicate WAXp maps 27 into {1, T} (or {false, true}). Given 
v € F, an AXp & represents an irreducible (or minimal) subset of the features 
which, if assigned the values dictated by v, are sufficient for the prediction c, 
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i.e. value changes to the features not in ¥ will not change the prediction. We 
can use the definition of the predicate WAXp to formalize the definition of the 
predicate AXp, also defined on subsets ¥ of F: 

AXp(¥) :=  — WAXp(¥) AV(A" © X) aWAXp(4’) (2) 
The definition of WAXp(4’) ensures that the predicate is monotone. Indeed, if 
X C X' C F, and if X is a weak AXp, then 1’ is also a weak AXp, as the 
fixing of more features will not change the prediction. Given the monotonicity 


of predicate WAXp, the definition of predicate AXp can be simplified as follows, 
with ¥ C F: 

AXp(X) := WAXp(X) AV(j € 2). -WAXp(4 \ {j}) (3) 
This simpler but equivalent definition of AXp has important practical signifi- 
cance, in that only a linear number of subsets needs to be checked for, as opposed 
to exponentially many subsets in (2). As a result, the algorithms that compute 
one AXp are based on (3) [54]. 
Example 3. From Example 1, and given the instance ((0,1,0,0),0), we can con- 
clude that the prediction will be 0 if features 1 and 3 take value 0, or if features 
1 and 4 take value 0. Hence, the AXp’s are {1,3} and {1,4}. It is also apparent 
that the assignment x2 = 1 bears no relevance on the fact that the prediction is 
0. 
Example 4. From Example 2, we can conclude that any sum of two variables as- 
signed value 1 suffices for the prediction. Hence, given the instance ((1, 1,1, 1), 1), 
the possible AXp’s are {1,2}, {1,3}, and {2,3}. Observe that the definition of 
k2 does not depend on feature 4. 

Besides abductive explanations, another commonly studied type of explana- 
tions are contrastive or counterfactual explanations [8, 36, 39,55]. As argued in 
related work [36], the duality between abductive and contrastive explanations 
implies that for the purpose of the queries studied in this paper, it suffices to 
study solely abductive explanations. 


3 Feature Relevancy & Necessity: Theory 


This section investigates the complexity of feature relevancy and necessity. We 
are interested in membership results, which allow us to devise algorithms for 
the target problems. We are also interested in hardness results, which serve to 
confirm that the running time complexities of the proposed algorithms are within 
reason, given the problem’s complexity. 


3.1 Defining Necessity, Relevancy & Irrelevancy 


Throughout this section, a classifier C is assumed, with features F, domains D,, 
i € F, classes K, a classification function « : F —> K, and a concrete instance 
(v,c), VEF,ceEK. 


6 For the sake of brevity, we opt to only present sketches of some of the proofs. 
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Definition 1 (Feature Necessity, Relevancy & Irrelevancy). Let A denote the set 
of all AXp’s for a classifier given a concrete instance, i.e. A = {£ C F | AXp(*¥)}, 
and let t € F be a target feature. Then, (i) t is necessary if t € Nxea&; (ii) t is 
relevant if t € Uyea¥; and (iii) t is irrelevant if t € F\Uxear. 

Throughout the remainder of the paper, the problem of deciding feature 
necessity is represented by the acronym FNP, and the problem of deciding feature 
relevancy is represented by the acronym FRP. 

Example 5. As shown earlier, for the d-DNNF classifier of Fig. 1, and given the 
instance (v1,c1) = ((0,1,0,0),0), there exist two AXp’s, i.e. {1,3} and {1, 4}. 
Clearly, feature 1 is necessary, and features 1, 3 and 4 are relevant. In contrast, 
feature 2 is irrelevant. 

Example 6. For the monotonic classifier of Fig. 2, and given the instance (v2, c2) = 
((1,1,1,1),1), we have argued earlier that there exist three AXp’s, ie. {1,2}, 
{1,3} and {2,3}, which allows us to conclude that features 1, 2 and 3 are relevant, 
but that feature 4 is irrelevant. In this case, there are no necessary features. 

The general complexity of necessity and (ir)relevancy has been studied in 
the context of logic-based abduction [25, 30,61]. Recent uses in explainability 
are briefly overviewed in Section 6. 


3.2 Feature Necessity 


Proposition 2. If deciding WAXp(’) is in complexity class €, then FNP is in 
the complexity class co-€. 

Given the known polynomial complexity of deciding whether a set is a weak 
AXp for several families of classifiers [54], we then have the following result: 


Corollary 3. For DTs, XpG’s’, NBCs, d-DNNF classifiers and monotonic clas- 
sifiers, FNP is in P. 


3.3 Feature Relevancy: Membership Results 


Proposition 4 (Feature Relevancy for DTs [36]). FRP for DTs is in P. 
Proposition 5. If deciding WAXp(1) is in P, then FRP is in NP. 

The argument above can also be used for proving the following results. 
Corollary 6. For XpG’s, NBCs, d-DNNF classifiers and monotonic classifiers, 
FRP is in NP. 

Proposition 7. If deciding WAXp(*%) is in NP, then FRP is in ZÈ. 
Corollary 8. For DLs, DSs, RFs, BTs, and NNs, FRP is in £}. 

Additional results. The following result will prove useful in designing algo- 
rithms for FRP in practice. 


Proposition 9. Let X¥ C F, and let t € X denote some target feature such 
that, WAXp(4’) holds and WAXp(¥ \ {t}) does not hold. Then, for any AXp 
ZCX CF, it must be the case that t € Z. 


T Explanation graphs (XpG’s) have been proposed to enable the computation of ex- 
planations for decision graphs, and (multi-valued) decision diagrams [36]. 
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3.4 Feature Relevancy: Hardness Results 


Proposition 10 (Relevancy for DNF Classifiers [36]). Feature relevancy for a 
DNF classifier is LP-hard. 

Proposition 11. Feature relevancy for monotonic classifiers is NP-hard. 
Proof. We say that a CNF is trivially satisfiable if some literal occurs in all 
clauses. Clearly, SAT restricted to nontrivial CNFs is still NP-complete. Let & 
be a not trivially satisfiable CNF on variables z1,..., £. Let N = 2k. Let & be 
identical to ® except that each occurrence of a negative literal x; (1 <i < k) is 
replaced by zi+ķ. Thus @ is a CNF on N variables each of which occur only posi- 
tively. Define the boolean classifier « (on N+1 features) by K(x, £1,..., £N) =1 
iff x; = tipk = 1 for some i € {1,...,k} or zo A &(x1,...,2N) = 1. To show 
that @ is monotonic we need to show that a < b > x(a) < «(b). This follows by 
examining the two cases in which «(a) = 1: if a; = ajz4 Aa < b, then b; = bi+k, 
whereas, if ag A @(a1,...,ay) = 1 and a < b, then bo A &(b1,...,bn) = 1 (by 
positivity of &), so in both cases k(b) = 1 > x(a). 

Clearly «(1y41) = 1. There are k obvious AXp’s of this prediction, namely 
{ii +k} (1 < i < k). These are minimal by the assumption that ® is not 
trivially satisfiable. This means that no other AXp contains both i and i+k for 
any i € {1,...,k}. Suppose that (u) = 1. Let X, be {0}U {i| 1 < i < k ^u; = 
1}U{i+k|1<i< k^u = 0}. Then X, is a weak AXp of the prediction 
«(1) = 1. Furthermore %,, does not contain any of the AXp’s {i,i+k}. Therefore 
some subset of ¥ is an AXp and clearly this subset must contain feature 0. Thus 
if @ is satisfiable, then there is an AXp which contains 0. 

We now show that the converse also holds. If X is an AXp of K(1n41) = 1 
containing 0, then it cannot also contain any of the pairs i,i +k (1 < i < k), 
otherwise we could delete 0 and still have an AXp. We will show that this implies 
that we can build a satisfying assignment u for ®. Consider first v = (vo,..., Un) 
defined by v; = 1 ifi € ¥ (0 < i < N) and uj4, = 1 if neither i nor i+ k belongs 
to ¥ (1 <i < k), and v; = 0 otherwise (1 < i < N). Then «(v) = 1 by definition 
of an AXp, since v agrees with the vector 1 on all features in X. We can also 
note that vo = 1 since 0 € ¥. Since ¥ does not contain i and i + k (1 < i < k), 
it follows that v; A vi}. Now let u; = 1 ifi € XA1< i< k. It is easy to verify 
that (u) = (v) = k(v) = 1. 

Thus, determining whether k(1y4+1) = 1 has an AXp containing the feature 
0 is equivalent to testing the satisfiability of &. It follows that FRP is NP-hard 
for monotonic classifiers by this polynomial reduction from SAT. 
Proposition 12. Relevancy for FBDD classifiers is NP-hard. 
Proof. Let w be a CNF formula defined on a variable set X = {21,...,@,} and 
with clauses {w1,...,Wn}. We aim to construct an FBDD classifier G (represent- 
ing a classification function «) based on ~ and a target variable in polynomial 
time, such that: w is SAT iff for « there is an AXp containing this target variable. 

For any literal l; € wi, replace l; with É. Let Y’ = {w1,..., wi} denote the 


resulting CNF formula defined on the new variables {t,...,vi,,...a7,...,a7}. 


’ m 


For each original variable x;, let I} and I; denote the indices of clauses con- 
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taining literal x; and ~g;, respectively. So if i € is then xi E wl, ifi E I; , then 
a2, € wi. To build an FBDD D from y”: 1) build an FBDD D; for each w4; 2) 
replace the terminal node 1 of D; with the root node of Di+1; D is read-once be- 
cause each variable a occurs only once in w’. Satisfying a literal x € wi means 
x; = 1, while satisfying a literal aah € wi, means xj = 0. If both x and sak are 
satisfied, then it means we pick inconsistent values for the variable x;, which is 
unacceptable. Let us define ¢ to capture inconsistent values for any variable xj: 


PEN jen (W z) ^ Va a )) a 


+ 5 = 
If If = 0, then let (Verp 24) = 0. If I7 = 0, then let (Vyer 72}) = 0. 
Any true point of ¢ means we pick inconsistent values for some variable xj, so 
it represents an unacceptable point of y. To avoid such inconsistency, one needs 
to at least falsify either Vj. I7 zi or Vre i ~r} for each variable xj. To build 


an FBDD G from ¢: 1) build FBDDs GF and G7 for Vje7+ 74 and Vper- ae 


respectively; 2) replace the terminal node 1 of G} with the root node of G}, let 
G; denote the resulting FBDD; 3) replace the terminal 0 of G; with the root 
node of G41; G is read-once because each variable z$ occurs only once in @. 

Create a root node labeled z8, link its 1-edge to the root of D, 0-edge to 
the root of G. The resulting graph G is an FBDD representing «K := (£8 A y’) V 
(32) A¢), K is a boolean classifier defined on {x9, x}, ... , £”, } and 2} is the target 
variable. The number of nodes of G is O(n xm). Let T = {(0,0), (1,1), ... (n, m)} 
denote the set of variable indices, for variable z$, (7,9) ET. 

Pick an instance v = {v9, . . . , v$, .. . } satisfying every literal of y’ (i.e. vi = 1 
and vf = 0 for z, =x} € y’) and such that vj = 1, then ~(v) = 1, and so 
k(v) = 1. Suppose ¥ C T is an AXp of v: 1) If {(i, j), (k,j)} C æ for some 
variable xj, where 7 € I 7 and k € J; , then for any point u of x such that 
ul, = vi for any (i, j) € X, we have x(u) = 1 and ¢(u) = 1. Moreover, if u sets 
u8 = 1, then x(u) = 1 implies ~’(u) = 1, else if u sets u8 = 0, then x(u) = 1 
because of ¢(u) = 1. (u) = 1 regardless the value of u8, so (0,0) ¢ 4. 2) If 
{(i,7), (k,j)} Z & for any variable xj, where i € I} and k € I; , then for some 
point u of « such that ui = vj for any (i, j) € ¥, we have ¢(u) # 1, in this case 
k(u) = 1 implies 7/(u) = 1, besides, any such u must set u8 = 1, so (0,0) € X. 

If case 2) occurs, then w is satisfiable. (a satisfying assignment is x; = 1 iff 
Jie I} s.t. (i, j) € X). If case 2) never occurs, then 7) is unsatisfiable. It follows 
that FRP is NP-hard for FBDD classifiers by this polynomial reduction from 
SAT. 


Corollary 13. Relevancy for d-DNNF classifiers is NP-hard. 


4 Feature Relevancy: Example Algorithms 


This section details two methods for FRP. One method decides feature relevancy 
for d-DNNF classifiers, whereas the other method decides feature relevancy for 
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Table 1: Encoding for deciding whether there is a weak AXp including feature t. 


Conditions Constraints Ful # 

Leaf (j), Feat(j, i), Sat(Lit(7), vi) nk (1.1) 
Leaf (j), Feat(7, i), =Sat(Lit(j), vi), i = k nk (1.2) 
Leaf(j), Feat(j, i), aSat(Lit(7), v:), i nk ensi (1.3) 
NonLeaf(j), Oper(j) = V E © Vrechildreng) i | 0-4 
NonLeaf(j), Oper(j) = ^ ny © Niechildreng) ” | (1-5) 

K(v) =0 =n? (1.6) 

K(v) =0 sont (1.7) 

St (1.8) 


arbitrary monotonic classifiers. Based on Proposition 2 and Corollary 3, existing 
algorithm for computing one AXp [35, 36, 52,53] can be used to decide feature 
necessity. Hence, there is no need for devising new algorithms. Additionally, the 
weak AXp returned from the proposed methods (if it exist) can be fed (as a 
seed) into the algorithms of computing one AXp [35,53] to extract one AXp in 
polynomial time. 


4.1 Relevancy for d-DNNF Classifiers 


This section details a propositional encoding that decides feature relevancy for 
d-DNNFs. The encoding follows the approach described in the proof of Proposi- 
tion 9, and comprises two copies (C° and C‘) of the same d-DNNF classifier C, 
C? encodes WAXp(4) (i.e. the prediction of x remains unchanged), Ct encodes 
=WAXp(X \ {t}) (i.e. the prediction of « changes). The encoding is polynomial 
in the size of classifier’s representation. 

The encoding is applicable to the case k(x) = 0. The case s(x) = 1 can 
be transformed to 7K(x) = 0, so we assume both d-DNNF C and its negation 
~C are given. To present the constraints included in this encoding, we need to 
introduce some auxiliary boolean variables and predicates. 

1. s; 1 <i < m. siis a selector such that s; = 1 iff feature i is included in a 

weak AXp candidate ¥. 

2. në, 1<j<|C]|and0<k<m. nk is the indicator of a node j of dd DNNF C 
for replica k. The eneRNOr for the root node of k-th replica is n*. Moreover, 
the semantics of nk is nk = 1 iff the sub-d-DNNF rooted at node j in k- th 
replica is coñsistent. 

Leaf(j) = 1 if the node j is a leaf node. 

NonLeaf(j) = 1 if the node j is a non-leaf node. 

Feat(j,7) = 1 if the leaf node j is labeled with feature i. 

Sat(Lit(j), v;) = 1 if for leaf node j, the literal on feature i is satisfied by v;. 


See we 
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The encoding is summarized in Table 1. As literals are d-DNNF leafs, the values 
of the selector variables only affect the values of the indicator variables of leaf 
nodes. Constraint (1.1) states that for any leaf node j whose literal is consis- 
tent with the given instance, its indicator nk is always consistent regardless of 
the value of s;. On the contrary, constraint (1.3) states that for any leaf node 7 
whose literal is inconsistent with the given instance, its indicator nk is consistent 
iff feature i is not picked, in other words, feature i can take any value. Because 
replica k (k > 0) is used to check the necessity of including feature k in ¥, we 
assume the value of the local copy of selector sę is 0 in replica k. In this case, 
as defined in constraint (1.2), even though leaf node j labeled feature k has a 
literal that is inconsistent with the given instance, its indicator nk is consistent. 
Constraint (1.4) defines the indicator for an arbitrary V node j. Constraint (1.5) 
defines the indicator for an arbitrary A node j. Together, these constraints de- 
clare how the consistency is propagated through the entire d-DNNF. Constraint 
(1.6) states that the prediction of the d-DNNF classifier C remains 0 since the 
selected features form a weak AXp. Constraint (1.7) states that if feature i is 
selected, then removing it will change the prediction of C. Finally, constraint 
(1.8) indicates that feature t must be included in X. 
Example 7. Given the d-DNNF classifier of Fig. 1 and the instance (v1, c1) = 
((0,1,0,0),0), suppose that the target feature is 3. We have selectors s = 
{s1, 82, 83, 4}, and the encoding is as follows: 
1. (R eng V n8) A (nen A n¥) A (n§ enl A n¥) A (nf enl V ng) A 
nt ent, A ngi) A (n§ ants A nfs) A (n8 781) A (ng 1) A (ng e1) A 


( 
( 

2. (n? en? V n3) A (n3 e n3 And) A (n3 e ng A n3) A (n? eng V ng) A 
(n3 o nf A n3) A (ng e n3 A n33) A (n3 781) A (n 1) A (ng &1) A 
(Mio © 1) A (nfi 784) A (nfz + 782) A (nfs 784) A (53 e n?) 

Given the AXp’s listed in Example 3, by solving these formulas we will either 
obtain {1,3} or {1,4} as the AXp. 


4.2 Relevancy for Monotonic Classifiers 


This section describes an algorithm for FRP in the case of monotonic classifiers. 
No assumption is made regarding the actual implementation of the monotonic 
classifier. 

Abstraction refinement for relevancy. The algorithm proposed in this sec- 
tion iteratively refines an over-approximation (or abstraction) of all the subsets 
S of F such that: i) S is a weak AXp, and ii) any AXp included in S also includes 
the target feature t. Formally, the set of subsets of F that we are interested in 
is defined as follows: 


H = {S C F|WAXp(S) AV(¥# C S). [AXp(4) >(t € ¥)]} (5) 
The proposed algorithm iteratively refines the over-approximation of set H until 
one can decide with certainty whether t is included in some AXp. The refinement 


step involves exploiting counterexamples as these are identified. (The approach is 
referred to as abstraction refinement FRP, since the use of abstraction refinement 
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can be related with earlier work (with the same name) in model checking [20].) In 
practice, it will in general be impractical to manipulate such over-approximation 
of set H explicitly. As a result, we use a propositional formula (in fact a CNF 
formula) H, such that the models of H encode the subsets of features about 
which we have yet to decide whether each of those subsets only contains AXp’s 
that include t. (Formula H is defined on a set of Boolean variables {51,..., Sm}, 
where each s; is associated with feature 7, and assigning s; = 1 denotes that 
feature 7 is included in a given set, as described below.) The algorithm then 
iteratively refines the over-approximation by filtering out sets of sets that have 
been shown not to be included in H, i.e. the so-called counterexamples. 


Algorithm 1 summarizes the proposed approach®. Also, Algorithms 2 and 3 
provide supporting functions. (For simplicity, the function calls of Algorithms 2 
and 3 show the arguments, but not the parameterizations.) Algorithm 1 itera- 
tively uses an NP oracle (in fact a SAT solver) to pick (or guess) a subset P of 
F, such that any previously picked set is not repeated. Since we are interested 
in feature t, we enforce that the picked set must include t. (This step is shown in 
lines 4 to 7.) Now, the features not in P are deemed universal, and so we need to 
account for the range of possible values that these universal features can take. 
For that, we update lower and upper bounds on the predicted classes. For the 
features in P we must use the values dictated by v. (This is shown in lines 8 
and 9, and it is sound to do because we have monotonicity of prediction.) If the 
lower and upper bounds differ, then the picked set is not even a weak AXp, and 
so we can safely remove it from further consideration. This is achieved by enforc- 
ing that at least one of the non-picked elements is picked in the future. (As can 
be observed H is updated with a positive clause that captures this constraint, 
as shown in line 11.) If the lower and upper bounds do not differ (i.e. we picked 
a weak AXp), and if by allowing t to take any value causes the bounds to differ, 
then we know that any AXp in P must include t, and so the algorithm reports P 
as a weak AXp that is guaranteed to be included in H. (This is shown in line 14.) 
It should be noted that P is not necessarily an AXp. However, by Proposition 9, 
P is guaranteed to be a weak AXp such that any of the AXp’s contained in P 
must include feature t. From [53], we know that we can extract an AXp from a 
weak AXp in polynomial time, and in this case we are guaranteed to always pick 
one that includes t. Finally, the last case is when allowing t to take any value 
does not cause the lower and upper bounds to change. This means we picked a 
set P that is a weak AXp, but not all AXp’s in P include the target feature t 
(again due to Proposition 9). As a result, we must prevent the same weak AXp 
from being re-picked. This is achieved by requiring that at least one of the picked 
features not be picked again in the feature set. (This is shown in line 16. As can 
be observed, H is updated with a negative clause that captures this constraint.) 


As can be concluded from Algorithm 1 and from the discussion above, Propo- 
sition 9 is essential to enable us to use at most two classification queries per iter- 


3 Arguments can either represent actual arguments or some parameterization; these 
are separated by a semi-colon. 
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Algorithm 1 Deciding feature relevancy for a monotonic classifier 


Input: Instance v, Target feature t; Feature Set F, Monotonic Classifier «K 


1: function DecideRelevant(v, t; F, «) 
2 Hep > H overapproximates H 
3 repeat 
4: (outc, s) — SAT (H, s+) > Pick candidate weak AXp containing t 
5: if outc = true then 
6: P + {ic F|si=1} >P is the candidate weak AXp, and t € P 
7 D + {ie F|s; = 0} > D contains the features not included in P 
8 vi 4 (Un,,---,ULy),8-t. Ut, <— ITE(s;, uj, Ali)) > vz: LB 
9: vu © (vu, ..-, VUn), S-t. vu, + ITE(s;, vi, u(i)) > vy: UB 
10: if k(vL) # K(vy) then > More than one value possible? 
11: H <HUnewPosCl(D,t) >P is not a weak AXp; block set 
12: else > P is a weak AXp 
13: if k(vrlvr, — A(t)]) A s(vu[vu, < u(t)]) then > t needed? 
14: reportWeakAXp(P) > t is included in any AXp ¥ CP 
15: return true 
16: H + HU newNegCl(P, t) > t unneeded; block set 
17: until outc = false 
18: return false > If H becomes inconsistent, then no AXp contains t 


Table 2: Example algorithm execution for t = 4 


s P D K(vz) k(vu) Decision New clause Line 
(0,0,0,1) {4} {1,2,3} 0 1 New pos clause (sı V s2 V s3) 11 
(1,0,0,1) {1,4} {2,3} 0 1 New pos clause (s2 V 83) 11 
(1,1,0,1) {1,2,4} {3} 1 1 New neg clause (ası V=s2) 16 
(1,0,1,1) {1,3,4} {2} 1 1 New neg clause (~s V=s3) 16 
(0,1,1,1) {2,3,4} {1} 1 1 New pos clause (s1) 11 

H inconsistent 7 17 


ation of the algorithm. If we were to use Proposition 5 instead, then the number 
of classification queries would be significantly larger. 


Example 8. We consider the monotonic classifier of Fig. 2, with instance (v, c) = 
((1,1,1,1),1). Table 2 summarizes a possible execution of the algorithm when 
t = 4. Similarly, Table 3 summarizes a possible execution of the algorithm when 
t = 1. (As with the current implementation, and for both examples, the cre- 
ation of clauses uses no optimizations.) In general, different executions will be 
determined by the models returned by the SAT solver. 


With respect to the clauses that are added to H at each step, as shown 
in Algorithms 2 and 3, one can envision optimizations (shown lines 2 to 7 in both 
algorithms) that heuristically aim at removing features from the given sets, and 
so produce shorter (and so logically stronger) clauses. The insight is that any 
feature, which can be deemed irrelevant for the condition used for constructing 
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Algorithm 2 Create new pos. clause Algorithm 3 Create new neg. clause 


Input: Set D, t; k, VL, VU Input: Set P, t; k, VL, VU 
1: function newPosCI(D, t; K, vL, vu) 1: function newNegCI(P, t; K, vL, Vu) 
2 for alli € D do 2: for alli € P \ {t} do 
3: (vr, vu) < (vi, Vi) 3: (vr, vu) — (Ali), u(i)) 
4: if K(v_) A K(vy) then 4: if K(v_) = K(vy) then 
5: D¢eD\ {i 5 PP \ {i} 
6: else 6: else 
7: (vr, vu: ) — (A(é), u(2)) T: (vz, vu) <— (vi, vi) 
8: w + (Viepsi) 8: w & (Viep\ ft} 75) 
9 return w 9 return w 


Table 3: Example algorithm execution for t = 1 


s P D k(vL) k(vu) Decision New clause Line 
(1,0,0,0) {1} {2,3,4} 0 1 New pos clause (s2 V s3 V s4) 11 
(1,1,0,0) {1,2} {3,4} 1 1 Weak AXp: {1,2} Z 14 


the clause, can be safely removed from the set. (In practice, our experiments 
show that the time running the classifier is far larger than the time spent using 
the NP oracle to guess sets. Thus, we opted to use the simplest approach for 
constructing the clauses, and so reduce the number of classification queries.) 
Given the above discussion, we can conclude that the proposed algorithm is 
sound, complete and terminating for deciding feature relevancy for monotonic 
classifiers. (The proof is straightforward, and it is omitted for the sake of brevity.) 


Proposition 14. For a monotonic classifier C, defined on set of features F, with 
k mapping F to K, and an instance (v,c), v € F, c € K, and a target feature 
t € F, Algorithm 1 returns a set P C F iff P is a weak AXp for (v, c), with the 
property that any AXp ¥ C P is such that t € ¥ (i.e. P is a witness for the 
relevancy of t). 


5 Experimental Results 


This section reports the experimental results on FRP for the d-DNNF and mono- 
tonic classifiers. The goal is to show that FRP is practically feasible. We opt not 
to include experiments for FNP as the complexity of FNP is in P. Besides, to the 
best of our knowledges, there is no baseline to compare with. The experiments 
were performed on a MacBook Pro with a 6-Core Intel Core i7 2.6 GHz processor 
with 16 GByte RAM, running macOS Monterey. 


d-DNNF classifiers. For d-DNNFs, we pick its subset SDDs as our target 
classifier. SDDs support polynomial time negation, so given a SDD C, one can 
obtain its negation ~C efficiently. 


180 X. Huang et al. 


Table 4: Solving FRP for SDDs. Sub-Columns Avg. #var and Avg. #cls show, re- 
spectively, the average number of variables and clauses in a CNF encoding. Column 
Runtime reports maximum and average time in seconds for deciding FRP. 


Dataset —_ SDD___igy CNF Runtime (s) 


7##Features #4Nodes Avg. #var Avg. #cls Max Avg. 
Accidents 415 8863 97 26513 78276 56.4 3.5 


Audio 272 7224 88 31148 100972 663.1 22.0 
DNA 513 8570 91 29155 91288 86.3 11.0 
Jester 254 7857 85 35998 121508 362.1 22.7 
KDD 306 8109 99 26402 83480 111.2 2.8 
Mushrooms 248 7096 91 23874 82112 266.3 15.8 
Netflix 292 7039 94 25520 83324 105.7 4.2 
NLTCS 183 6661 100 19817 58494 1.4 05 
Plants 244 6724 97 25356 84782 950.7 20.6 
RCV-1 410 9472 90 33438 102500 153.6 11.2 


Retail 341 3704 87 10601 28342 1.8 11 


Monotonic classifiers. For monotonic classifiers, we consider the Deep Lat- 
tice Network (DLN) [70] as our target classifier. Since our approach for mono- 
tonic classifier is model-agnostic, it could also be used with other approaches for 
learning monotonic classifiers [48, 69] including Min-Max Network [21,64] and 
COMET [65]. 


Prototype implementation. Prototype implementations of the proposed ap- 
proaches were implemented in Python °. The PySAT toolkit 1° was used for 
propositional encodings. Besides, PySAT invokes the Glucose 4 1! SAT solver to 
pick a weak AXp candidate. SDDs were loaded by using the PySDD !’package. 


Benchmarks & training. For SDDs, we selected 11 datasets from Density 
Estimation Benchmark Datasets'?. [34,46,49]. 11 datasets were used to learn 
SDD using LearnSDD [11] (with parameter marEdges=20000). The obtained 
SDDs were used as binary classifiers. For DLNs, we selected 5 publicly avail- 
able datasets: australian (aus), breast_ cancer (b.c.), heart_c, nursery [57] and 
pima [2]. We used the three-layer DLN architecture: Calibrators > Random 
Ensemble of Lattices — Linear Layer. All calibrators for all models used a fixed 
number of 20 keypoints. And the size of all lattices was set to 3. 


Results for SDDs. For each SDD, 100 test instances were randomly gener- 
ated. All tested instances have prediction 0. (We didn’t pick instances predicted 
to class 1 as this requires the compilation of a new classifier which may have dif- 


° nttps: //github.com/XuanxiangHuang/frp- experiment 
10 nttps://github.com/pysathq/pysat 
4 https: //www. labri.fr/perso/1simon/glucose/ 
12 https: //github.com/wannesm/PySDD 
13 https: //github.com/UCLA-StarAI/Density-Estimation-Datasets 
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Table 5: Solving FRP for DLN. Column Runtime reports maximum and average time in 
seconds for deciding FRP. Column SAT Time (resp. «(v) Time) reports maximum and 
average time in seconds for SAT solver (resp. calling DLN’s predict function) to decide 
FRP. Column SAT Calls (resp. (v) Calls) reports maximum and average number of 
calls to the SAT solver (resp. to the DLN’s predict function) to decide FRP. 


Runtime (s)|SAT Time|SAT Calls 
Max Avg. Max Avg. Max Avg. Max Avg. Max Avg. 


aus 61 40.4 8.31 0.02 0.01 291 65 40.0 8.15 424 98 97.8% 
De 45 5.4 1.93 0.00 0.00 53 20 5.3 1.89 78 30 98.0% 
heart_c 35 31.5 6.67 0.02 0.00 171 54 31.1 6.52 249 80 97.7% 
nursery 45 4.3 1.77 0.00 0.00 31 13 4.3 1.75 73 30 98.6% 
pima 74 3.7 1.41 0.00 0.00 33 13 3.7 1.39 47 22 98.4% 


Dataset HY K(v) Time|«(v) Calls «(v)Time 


Runtime 


ferent size). Besides, for each instance, we randomly picked a feature appearing 
in the model. Hence for each SDD, we solved 100 queries. Table 4 summarizes 
the results. It can be observed that the number of nodes of the tested SDD is in 
the range of 3704 and 9472, and the number of features of tested SDD is in the 
range of 183 and 513. Besides, the percentage of examples for which the answer 
is Y (i.e. target feature is in some AXp) ranges from 85% to 100%. Regarding 
the runtime, the largest running time for solving one query can exceed 15 min- 
utes. But the average running time to solve a query is less than 25 seconds, this 
highlights the scalability of the proposed encoding. 


Results for DLNs. For each DLN, we randomly picked 200 tested instances, 
and for each tested instance, we randomly pick a feature. Hence for each DLN, we 
solved 200 queries. Table 5 summarizes the results. The use of a SAT solver has a 
negligible contribution to the running time. Indeed, for all the examples shown, 
at least 97% of the running time is spent running the classifier. This should be 
unsurprising, since the number of the iterations of Algorithm 1 never exceeds a 
few hundred. (The fraction of a second reported in some cases should be divided 
by the number of calls to the SAT solver; hence the time spent in each call to the 
SAT solver is indeed negligible.) As can be observed, the percentage of examples 
for which the answer is Y (i.e. target feature is in some AXp and the algorithm 
returns true) ranges from 35% to 74%. There is no apparent correlation between 
the percentage of Y answers and the number of iterations. The large number of 
queries accounts for the number of times the DLN is queried by Algorithm 1, 
but it also accounts for the number of times the DLN is queried for extracting 
an AXp from set P (i.e. the witness) when the algorithm’s answer is true. A 
loose upper bound on the number of queries to the classifier is 4 x NS +2 x |F], 
where NS is the number of SAT calls, and |F| is the number of features. Each 
iteration of Algorithm 1 can require at most 4 queries to the classifier. After 
reporting P, at most 2 queries per feature will be required to extract the AXp 
(see Section 2.3). As can be observed this loose upper bound is respected by the 
reported results. 
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6 Related Work 


The problems of necessity and relevancy have been studied in logic-based abduc- 
tion since the early 90s [25, 30,61]. However, this earlier work did not consider 
the classes of (classifier) functions that are considered in this paper. 

There has been recent work on explainability queries [7,8,36]. Some of these 
queries can be related with feature relevancy and necessity. For example, rel- 
evancy and necessity have been studied with respect to a target class [7, 8], 
in contrast with our approach that studies a concrete instance, and so can be 
naturally related with earlier work on abduction. Recent work [36] studied fea- 
ture relevancy under the name feature membership, but neither d-DNNF nor 
monotonic classifiers were discussed. Moreover, [36] only proved the hardness 
of deciding feature relevancy for DNF and DT classifiers and did not discuss 
the feature necessity problem. The results presented in this paper complement 
this work. Besides, the complexity results of FRP and FNP in this paper also 
complement the recent work [54] which summarizes the progress of formal expla- 
nations. [40] focused on the computation of one arbitrary AXp and one smallest 
AXp, which is orthogonal to our work. Computing one AXp does not guarantee 
that either FRP or FNP is decided, since the target feature t may not appear in 
the computed AXp. [53] studied the computation of one formal explanation and 
the enumeration of formal explanations in the case study of monotonic classifiers. 
However, neither FRP or FNP were identified and studied. 


7 Conclusions 


This paper studies the problems of feature necessity and relevancy in the context 
of formal explanations of ML classifiers. The paper proves several complexity re- 
sults, some related with necessity, but most related with relevancy. Furthermore, 
the paper proposes two different approaches for solving relevancy for two families 
of classifiers, namely classifiers represented with the d-DNNF propositional lan- 
guage, and monotonic classifiers. The experimental results confirm the practical 
scalability of the proposed algorithms. Future work will seek to prove hardness 
results for the families of classifiers for which hardness is yet unknown. 
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Abstract. With the rapid growth of machine learning, deep neural net- 
works (DNNs) are now being used in numerous domains. Unfortunately, 
DNNs are “black-boxes”, and cannot be interpreted by humans, which is 
a substantial concern in safety-critical systems. To mitigate this issue, re- 
searchers have begun working on explainable AI (XAI) methods, which 
can identify a subset of input features that are the cause of a DNN’s 
decision for a given input. Most existing techniques are heuristic, and 
cannot guarantee the correctness of the explanation provided. In con- 
trast, recent and exciting attempts have shown that formal methods can 
be used to generate provably correct explanations. Although these meth- 
ods are sound, the computational complexity of the underlying verifica- 
tion problem limits their scalability; and the explanations they produce 
might sometimes be overly complex. Here, we propose a novel approach 
to tackle these limitations. We (i) suggest an efficient, verification-based 
method for finding minimal explanations, which constitute a provable 
approximation of the global, minimum explanation; (ii) show how DNN 
verification can assist in calculating lower and upper bounds on the op- 
timal explanation; (iii) propose heuristics that significantly improve the 
scalability of the verification process; and (iv) suggest the use of bundles, 
which allows us to arrive at more succinct and interpretable explanations. 
Our evaluation shows that our approach significantly outperforms state- 
of-the-art techniques, and produces explanations that are more useful to 
humans. We thus regard this work as a step toward leveraging verification 
technology in producing DNNs that are more reliable and comprehensi- 
ble. 


1 Introduction 


Machine learning (ML) is a rapidly growing field with a wide range of applica- 
tions, including safety-critical, high-risk systems in the fields of health care [18], 
aviation [38] and autonomous driving [12]. Despite their success, ML models, 
and especially deep neural networks (DNNs), remain “black-boxes” — they are 
incomprehensible to humans and are prone to unexpected behaviour and errors. 
This issue can result in major catastrophes [13,73], and also in poor decision- 
making due to brittleness or bias [7, 24]. 

In order to render DNNs more comprehensible to humans, researchers have 
been working on explainable AI (XAI), where we seek to construct models for 
© The Author(s) 2023 
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explaining and interpreting the decisions of DNNs [50,55-57]. Work to date has 
focused on heuristic approaches, which provide explanations, but do not provide 
guarantees about the correctness or succinctness of these explanations [14,32,44]. 
Although these approaches are an important step, their limitations might result 
in skewed results, possibly failing to meet the regulatory guidelines of institu- 
tions and organizations such as the European Union, the US government, and 
the OECD [51]. Thus, producing DNN explanations that are provably accurate 
remains of utmost importance. 

More recently, the formal verification community has proposed approaches 
for providing formal and rigorous explanations for DNN decision making [27,31, 
51,59]. Many of these approaches rely on the recent and rapid developments in 
DNN verification [1,8, 9,39]. These approaches typically produce an abductive 
explanation (also known as a prime implicant, or Pl-explanation) [31,58, 59]: 
a minimum subset of input features, which by themselves already determine 
the classification produced by the DNN, regardless of any other input features. 
These explanations afford formal guarantees, and can be computed via DNN 
verification [31]. 

Abductive explanations are highly useful, but there are two major difficulties 
in computing them. First, there is the issue of scalability: computing locally 
minimal explanations might require a polynomial number of costly invocations 
of the underlying DNN verifier, and computing a globally minimal explanation 
is even more challenging [10, 31, 48]. The second difficulty is that users may 
sometimes prefer “high-level” explanations, not based solely on input features, 
as these may be easier to grasp and interpret compared to “low-level”, complex, 
feature-based explanations. 

To tackle the first difficulty, we propose here new approaches for more effi- 
ciently producing verification-based abductive explanations. More concretely, we 
propose a method for provably approximating minimum explanations, allowing 
stakeholders to use slightly larger explanations that can be discovered much more 
quickly. To accomplish this, we leverage the recently discovered dual relationship 
between explanations and contrastive examples [30]; and also take advantage of 
the sensitivity of DNNs to small adversarial perturbations [64], to compute both 
lower and upper bounds for the minimum explanation. In addition, we propose 
novel heuristics for significantly expediting the underlying verification process. 

In addressing the second difficulty, i.e. the interpretability limitations of “low- 
level” explanations, we propose to construct explanations in terms of bundles, 
which are sets of related features. We empirically show that using our method 
to produce bundle explanations can significantly improve the interpretability of 
the results, and even the scalability of the approach, while still maintaining the 
soundness of the resulting explanations. 

To summarize, our contributions include the following: (i) We are the first 
to suggest a method that formally produces sound and minimal abductive ex- 
planations that provably approximate the global-minimum explanation. (ii) Our 
three suggested novel heuristics expedite the search for minimal abductive ex- 
planations, significantly outperforming the state of the art. (iii) We suggest a 
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novel approach for using bundles to efficiently produce sound and provable ex- 
planations that are more interpretable and succinct. 

For evaluation purposes, we implemented our approach as a proof-of-concept 
tool. Although our method can be applied to any ML model, we focused here 
on DNNs, where the verification process is known to be NP-complete [39], and 
the scalable generation of explanations is known to be challenging [31,58]. We 
used our tool to test the approach on DNNs trained for digit and clothing classi- 
fication, and also compared it to state-of-the-art approaches [31,32]. Our results 
indicate that our approach was successful in quickly producing meaningful ex- 
planations, often running 40% faster than existing tools. We believe that these 
promising results showcase the potential of this line of work. 

The rest of the paper is organized as follows. Sec. 2 contains background on 
DNNs and their verification, as well as on formal, minimal explanations. Sec. 3 
covers the main method for calculating approximations of minimum explana- 
tions, and Sec. 4 covers methods for improving the efficiency of calculating these 
approximations. Sec. 5 covers the use of bundles in constructing “high-level”, 
provable explanations. Next, we present our evaluation in Sec. 6. Related work 
is covered in Sec. 7, and we conclude in Sec. 8. 


2 Background 


DNNs. A deep neural network (DNN) [46] is a directed graph composed of 
layers of nodes, commonly called neurons. In feed-forward NNs the data flows 
from the first (input) layer, through intermediate (hidden) layers, and onto an 
output layer. A DNN’s output is calculated by assigning values to its input 
neurons, and then iteratively calculating the values of neurons in subsequent 
layers. In the case of classification, which is the focus of this paper, each output 
neuron corresponds to a specific class, and the output neuron with the highest 
value corresponds to the class the input is classified to. 
Fig. 1 depicts a simple, feed-forward 


-4 0 

DNN. The input layer includes three neu- „1 2 ReLU 1 
rons, followed by a weighted sum layer, ®; © ©- @ 
which calculates an affine transformation 3 7 oO | 

: : me 3 ReLU 
of values from the input layer. Given the “1 @ 3 0-0 
input Vı = [1,1,1]7, the second layers 2 A ane @ 
computes the values V2 = [6,9, 11]7. Next H) ES a> 0-0- 
comes a ReLU layer, which computes the 
function ReLU(x) = max(0,x) for each Fig. 1: A simple DNN. 


neuron in the preceding layer, resulting in 

V3 = [6,9,11]T. The final (output) layer then computes an affine transformation, 
resulting in V, = [15,-4]". This indicates that input V; = [1,1,1]” is classified 
as the category corresponding to the first output neuron, which is assigned the 
greater value. 


DNN Verification. A DNN verification query is a tuple (P, N,Q), where N isa 
DNN that maps an input vector x to an output vector y = N(x), P is a predicate 
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on x, and Q is a predicate on y. A DNN verifier needs to decide whether there 
exists an input zo that satisfies P(xo) A Q(N(xo)) (the SAT case) or not (the 
UNSAT case). Typically, P and Q are expressed in the logic of real arithmetic [49]. 
The DNN verification problem is known to be NP-Complete [39]. 


Formal Explanations. We focus here on explanations for classification prob- 
lems, where a model is trained to predict a label for each given input. A clas- 
sification problem is a tuple (F,D,K,N) where (i) F = {1,...,m} denotes the 
features; (ii) D = {D,, D2..., Dm} denotes the domains of each of the features, 
i.e. the possible values that each feature can take. The entire feature (input) 
space is hence F = D; x Də x... x Dm; (iii) K = {c1, C2, ..., Cn } is a set of classes, 
i.e. the possible labels; and (iv) N : F > K is a (non-constant) classification 
function (in our case, a neural network). A classification instance is the pair 
(v,c), where v € F, ce K, and c = N(v). In other words, v is mapped by the 
neural network N to class c. 

Looking at (v,c), we often wish to know why v was classified as c. Informally, 
an explanation is a subset of features E € F, such that assigning these features 
to the values assigned to them in v already determines that the input will be 
classified as c, regardless of the remaining features F \ E. In other words, even 
if the values that are not in the explanation are changed arbitrarily, the classifi- 
cation remains the same. More formally, given input v = (v1,...Um) € F with the 
classification N(v) = c, an explanation (sometimes referred to as an abductive 
explanation, or an AXP) is a subset of the features E c F, such that: 

vee). [A (i =v) > (V@)=0)] (1) 
icE 

We continue with the running example from Fig. 1. For simplicity, we assume 
that each input neuron can only be assigned the values 0 or 1. It can be observed 
that for input V; = [1,1,1]”, the set {v}, v?} is an explanation; indeed, once the 
first two entries in V; are set to 1, the classification remains the same for any 
value of the third entry (see Fig. 2). We can prove this by encoding a verification 
query (P, N,Q) = (E = v, N,Q-c), where E is the candidate explanation, and 
E =v means that we restrict the features in Æ to their values in v; and Qae 
implies that the classification is not c. An UNSAT result for this query indicates 
that E is an explanation for instance (v,c). 

Clearly, the set of all features constitutes a trivial explanation. However, 
we are interested in smaller explanation subsets, which can provide useful in- 
formation regarding the decision of the classifier. More precisely, we search for 
minimal explanations and minimum explanations. A subset EC F is a minimal 
explanation (also referred to as a local-minimal explanation, or a subset-minimal 
explanation) of instance (v,c) if it is an explanation that ceases to be an expla- 
nation if even a single feature is removed from it: 


(Y(2 €F).[Ajen (a; = vi) > (N(x) = 0)])A 
(Vj € E).[A(y € F).[Aiczxj (yi = vi) A (N (y) # c)]) 


Fig. 3 demonstrates that {vj,v7} is a minimal explanation in our running ex- 
ample: removing any of its features allows mis-classification. 


(2) 
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Fig. 3: {vł,v?} is a minimal explanation for input V; = [1,1,1]”. 


A minimum explanation (sometimes referred to as a cardinal minimal ez- 
planation or a PI-ezplanation) is defined as a minimal explanation of minimum 
size; i.e., if E is a minimum explanation, then there does not exist a minimal 
explanation E’ + E such that |E'| < |E|. Fig. 4 demonstrates that {vf} is a 
minimum explanation for our running example. 


Fig. 4: {v} } is a minimum explanation for input V; = [1,1,1]?. 


Contrastive Example. A subset of features C ¢ F is called a contrastive ezam- 
ple or a contrastive explanation (CXP) if altering the features in C is sufficient 
to cause the misclassification of a given classification instance (v,c): 


A(x €F).[Aservo(as =v) a (N(2) #0)] (3) 
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A contrastive example for our running ex- „1 @ 2 @-@ i 

ample is shown in Fig. 5. Notice that the : pa Tn 
question of whether a set is a contrastive ; x 3 no da @ 
example can be encoded into a verification “1 @ 3 010 

query (P, N,Q) = ((F > C) = v, N, Q-c)}, 5 > ow @ 
where a SAT result indicates that C is H 6 0-0 i 


a contrastive example. As with explana- 

tions, smaller contrastive examples are Fig.5: {v2, v3} is a contrastive ex- 
more valuable than large ones. One useful ample for V; = [1,1, 1]?. 

notion is that of a contrastive singleton: a 

contrastive example of size one. A contrastive singleton could represent a specific 
pixel in an image, the alteration of which could result in misclassification. Such 
singletons are leveraged in “one-pixel attacks” [64] (see Fig. 16 in the appendix 
of the full version of this paper [11]). Contrastive singletons have the following 
important property: 


Lemma 1. Every contrastive singleton is contained in all explanations. 


The proof appears in Sec. A of the appendix of the full version of this pa- 
per [11]. Lemma 1 implies that each contrastive singleton is contained in all 
minimal/minimum explanations. 

We consider also the notion of a contrastive pair, which is a contrastive ex- 
ample of size 2. Clearly, for any pair of features (u,v) where u or v are con- 
trastive singletons, (u,v) is a contrastive pair; however, when we next refer to 
contrastive pairs, we consider only pairs that do not contain any contrastive 
singletons. Likewise, for every k > 2, we can consider contrastive examples of 
size k, and we exclude from these any contrastive examples of sizes 1,...,k-1 
as subsets. 

We state the following theorem, whose proof also appears in Sec. A of the 
appendix of the full version of this paper [11]: 


Lemma 2. All explanations contain at least one element of every contrastive 
pair. 


The theorem can be generalized to any k > 2; and can be used in showing that 
the minimum hitting set (MHS) of all contrastive examples is exactly the mini- 
mum explanation [29,54] (see Sec. B of the appendix of the full version of this 
paper [11]). Further, the theorem implies a duality between contrastive exam- 
ples and explanations [30,34]: a minimal hitting set of all contrastive examples 
constitutes a minimal explanation, and a minimal hitting set of all explanations 
constitutes a minimal contrastive example. 


3 Provable Approximations for Minimal Explanations 


State-of-the-art approaches for finding minimum explanations exploit the MHS 
duality between explanations and contrastive examples [31]. The idea is to it- 
eratively compute contrastive examples, and then use their MHS as an under- 
approximation for the minimum explanation. Finding this MHS is an NP- 
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complete problem, and is difficult in practice as the number of contrastive ex- 
amples increases [20]; and although the MHS can be approximated using max- 
imum satisfiability (MaxSAT) or mixed integer linear programming (MILP) 
solvers [26,47], existing approaches tackle simpler ML models, such as decision 
trees [33,36], but face scalability limitations when applied to DNNs [31,58]. Fur- 
ther, enumerating all contrastive examples may in itself take exponential time. 
Finally, recall that DNN verification is an NP-Complete problem [39]; and so dis- 
patching a verification query to identify each explanation or contrastive example 
is also very slow, when the feature space is large. Finding minimal explanations 
may be easier [31], but may converge to larger and less meaningful explana- 
tions, while still requiring a linear number of calls to the underlying verifier. Our 
approach, described next, seeks to mitigate these difficulties. 

Our overall approach is described in Algorithm 1. It is comprised of two sep- 
arate threads, intended to be run in parallel. The upper bounding thread (Typ) is 
responsible for computing a minimal explanation. It starts with the entire feature 
space, and then gradually reduces it, until converging to a minimal explanation. 
The size of the presently smallest explanation is regarded as an upper bound 
(UB) for the size of the minimum explanation. Symmetrically, the lower bounding 
thread (Tip) attempts to construct small contrastive sets, used for computing a 
lower bound (LB) on the size of the minimum explanation. Together, these two 
bounds allow us to compute the approximation ratio between the minimal ex- 
planation that we have discovered and the minimum explanation. For instance, 
given a minimal explanation of size 7 and a lower bound of size 5, we can deduce 
that our explanation is at most eB = T times larger than the minimum. The 
two threads share global variables that indicate the set of contrastive singletons 
(Singletons), the set of contrastive pairs (Pairs), the upper and lower bounds 
(UB, LB), and the set of features that were determined not to participate in 
the explanation and are “free” to be set to any value (Free). The output of our 
algorithm is a minimal explanation (F~ Free), and the approximation ratio (FB). 
We next discuss each of the two threads in detail. 


Algorithm 1 Minimal Explanation Search 


Input N (Neural network), F (features), v (input values), c (class prediction) 
1: Singletons, Pairs, Free — @, UB <- |F|, LB < 0 > Global variables 
2: Launch thread Typ 
3: Launch thread Typ 
4: 


return F\Free, te 


The Upper Bounding Thread (Tyg). This thread, whose pseudocode ap- 
pears in Algorithm 2, follows the framework proposed by Ignatiev et al. [31]: it 
seeks a minimal explanation by starting with the entire feature space, and then 
iteratively attempting to remove individual features. If removing a feature allows 
misclassification, we keep it as part of the explanation; otherwise, we remove it 
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and continue. This process issues a single verification query for each feature, un- 
til converging to a minimal explanation (lines 2-8). Although this naive search 
is guaranteed to converge to a minimal explanation, it needs not to converge to 
a minimum explanation; and so we apply a more sophisticated ordering scheme, 
similar to the one proposed by [32], where we use some heuristic model as a 
way for assigning weights of importance to each input feature. We then check 
the “least important” input features first, since freeing them has a lower chance 
of causing a misclassification, and they are consequently more likely to be suc- 
cessfully removed. We then continue iterating over features in ascending order 
of importance, hopefully producing small explanations. 


Algorithm 2 Typ: Upper Bounding Thread 


1: Use a heuristic model to sort F’s features by ascending relevance 
2: for each f € F do 
3: Explanation <+ F\Free 


4: if Verify((Explanation, {f})=v,N,Q_-) is UNSAT then 
5: Free < Free u {f} 

6: UB +- UB-1 

T: end if 

8: end for 


The Lower Bounding Thread (Typ). The pseudocode for the lower bound- 
ing thread (Tg) appears in Algorithm 3. In lines 1-6, the thread searches for 
contrastive singletons. Neural networks were shown to be very sensitive to ad- 
versarial attacks [25] — slight input perturbations that cause misclassification 
(e.g., the aforementioned one-pixel attack [64]) — and this suggests that con- 
trastive sets, and in particular contrastive singletons, exist in many cases. We 
observe that identifying contrastive singletons is computationally cheap: by en- 
coding Eq. 3 as a verification query, once for each feature, we can discover all 
singletons; and in these queries all features but one are fixed, which empirically 
allows verifiers to dispatch them quickly. 

The rest of Tip (lines 9-13) performs a similar process, but with contrastive 
pairs (which do not contain contrastive singletons as one of their features). We 
use verification queries to identify all such pairs, and then attempt to find their 
MHS. We observe that finding the MHS of all contrastive pairs is the 2-MHS 
problem, which is a reformalization of the minimum vertex cover problem (see 
Sec. B of the appendix of the full version of this paper [11]). Since this is an 
easier problem than the general MHS problem, solving it with MAX-SAT or 
MILP often converges quickly. In addition, the minimum vertex cover algorithm 
has a linear 2-approximating greedy algorithm, which can be used for finding a 
lower bound in cases of large feature spaces. 

More formally, Tug performs an efficient computation of the following bound: 


LB = |Singletons| + |MVC(Pairs)| < MHS(Cxps) = Em (4) 
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Algorithm 3 7,3: Lower Bounding Thread 


1: for each f € F do > Find all singletons 
2: if Verify((F\{f}=v,N,Q_-) is SAT then 

3: Singletons + Singletons u {f} 

4: LB « LB +1 

5: end if 

6: end for 

T: 

8: AllPairs < Distinct pairs of F\Singletons 

9: for each (a,b) € AllPairs do > Find all pairs 
10: if Verify((F\{a,b}=v,N,Q_-) is SAT then 

11: Pairs «Pairs u {(a,b)} 

12: end if 

13: end for 


14: LB < LB + MVC(Pairs) 


where MVC is the minimum vertex cover, Cxps denotes the set of all contrastive 
examples, and Em is the size of the minimum explanation. 

It is worth mentioning that this approach can be extended to use contrastive 
examples of larger sizes (k = 3,4,...), as specified in Sec. C of the appendix of 
the full version of this paper. The fact that small contrastive examples, such 
as singletons, exist in large, state-of-the-art DNNs with large inputs [21, 64] 
suggests that useful approximations exist in large DNNs. In our experiments, 
we observed that using only singletons and pairs affords good approximations, 
without incurring overly expensive computations by the underlying verifier. 


4 Finding Minimal Explanations Efficiently 


Algorithm 1 is the backbone of our approach, but it suffers from limited scalabil- 
ity — particularly, in Tyg. As the execution of Typ progresses, and as additional 
features are “freed”, the quickly growing search space slows down the underlying 
verifier. Here we propose three different methods for expediting this process, by 
reducing the number of verification queries required. 


Method 1: Using Information from Typ. We suggest to leverage the con- 
trastive examples found by Tg to expedite Typ. The process is described in 
Algorithm 4. In line 3, Typ is queried for the current set of contrastive sin- 
gletons, which we know must be part of any minimal explanation. These are 
subtracted from the RemainingFeatures set (features left for Tyg to query), and 
consequently will not be added to the Free set — i.e., they are marked as part 
of the current explanation. In addition, for any contrastive pair (a,b) found by 
Tp, either a or b must appear in any minimal explanation; and so, our algorithm 
skips checking the case where both a and b are removed from F (Line 8). (the 
method could also be extended to contrastive sets of greater cardinality.) 
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Algorithm 4 Tyg using information from TLB 


1: Use a heuristic model to sort F by ascending relevance 
2: RemainingFeatures + F\Singletons 

3: for each f € RemainingFeatures do 

4 Explanation + F\Free 

5 if Verify((Explanation, {f})=v,N,Q_-) is UNSAT then 

6: Free <+ Free u {f} 

T: UB- UB-1 

8 Delete all features in a pair with f from RemainingFeatures 
9 end if 

10: end for 


Method 2: Binary Search. Sorting the features being considered in ascending 
order of importance can have a significant effect on the size of the explanation 
found by Algorithm 2. Intuitively, a “perfect” heuristic model would assign the 
greatest weights to all features in the minimum explanation, and so traversing 
features in ascending order would first discover all the features that can be 
removed (UNSAT verification queries), followed by all the features that belong in 
the explanation (SAT queries). In this case, a sequential traversal of the features 
in ascending order is quite wasteful, and it is much better to perform a binary 
search to find the point where the answer flips from UNSAT to SAT. 

Of course, in practice, the heuristic models are not perfect, leading to poten- 
tial cases with multiple “flips” from SAT to UNSAT, and vice versa. Still, if the 
heuristic is good in practice (which is often the case; see Sec. 6), these flips are 
scarce. Thus, we propose to perform multiple binary searches, each time identi- 
fying one SAT query (i.e., a feature added to the explanation). Observe that each 
time we hit an UNSAT query, this indicates that all the queries for features with 
lower priorities would also yield UNSAT — because if “freeing” multiple features 
cannot change the classification, changing fewer features certainly cannot. Thus, 
we are guaranteed to find the first SAT query in each iteration, and soundness 
is maintained. This process is described in Algorithm. 6 and in Fig. 14 in the 
appendix of the full version of this paper [11]. 


Method 3: Local-Singleton Search. Let N be a DNN, and let x be an input 
point whose classification we seek to explain. As part of Algorithm 2, Tyg iter- 
atively “frees” certain input features, allowing them to take arbitrary values, as 
it continues to search for features that must be included in the explanation. The 
increasing number of free features enlarges the search space that the underlying 
verifier must traverse, thus slowing down verification. We propose to leverage 
the hypothesis that input points nearby x that are misclassified tend to be clus- 
tered; and so, it is beneficial to fiz the free features to “bad” values, as opposed 
to letting them take on arbitrary values. We speculate that this will allow the 
verifier to discover satisfying assignments much more quickly. 

This enhancement is shown in Algorithm 5. Given a set Free of features that 
were previously freed, we fix their values according to some satisfying assign- 
ment previously discovered. Thus, the verification of any new feature that we 
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consider is similar to the case of searching for contrastive singletons, which, as 
we already know, is fairly fast. See Fig. 15 in the appendix of the full version of 
this paper [11] for an illustration. The process can be improved further by fixing 
the freed features to small neighborhoods of the previously discovered satisfy- 
ing assignment (instead of its exact values), to allow some flexibility while still 
keeping the query’s search space small. 


Algorithm 5 Tyg using local-singleton search 


1: Use a heuristic model to sort F by ascending relevance 
2: RemainingFeatures <— F\Singletons 
3: for each f € RemainingFeatures do 


4 Explanation + F\Free 

5 if Verify((Explanation, {f})=v,N,Q.-) is UNSAT then 

6: Free < Free u {f} 

7 UB< UB-1 

8 else 

9: Extract counter example C 

10: LocalSingletons +- Ø 

11: for each f’ € RemainingFeatures do 

12: if Verify(Explanation, { f’} = C,N,Q-c) is SAT then 
13: LocalSingletons + LocalSingletons u { f’} 

14: end if 

15: end for 

16: RemainingFeatures + RemainingFeatures \ LocalSingletons 
17: end if 

18: end for 


5 Minimal Bundle Explanations 


So far, we presented methods for generating explanations 
within a given approximation ratio of the minimum expla- 
nation (Sec. 3), and for expediting the computation of these 
explanations (Sec. 4) — in order to improve the scalability of 
our explanation generation mechanism. Next, we seek to tackle 
the second challenge from Sec. 1, namely that these explana- 
tions may be too low-level for many users. To address this chal- 
lenge, we focus on bundles, which is a topic well covered in the Fig. 6: Partition 
ML [63] and heuristic XAI literature [50,55] (commonly known input’s features 
as “super-pixels” for computer-vision tasks). Intuitively, bun- into bundles. 
dles are a partitioning of the features into disjoint sets (an 

illustration appears in Fig. 6). The idea, which we later validate empirically, is 
that providing explanations in terms of bundles is often easier for humans to 
comprehend. As an added bonus, using bundles also curtails the search space 
that the verifier must traverse, expediting the process even further. 
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Given a feature space F = {1,...,m}, a bundle b is just a subset b € F. When 
dealing with the set of all bundles B = {bj, bo, ...b, }, we require that they form 
a partitioning of F, namely F = wb;. We define a bundle explanation Eg for a 
classification instance (v,c) as a subset of bundles, Eg € B, such that: 


V(a € F).[Aicuzs (ti = vi) > (N(x) = ©)] (5) 


The following theorem then connects bundle explanations and explicit, non- 
bundle explanations: 


Theorem 1. The union of features in a bundle explanation is an explanation. 


The proof directly follows from Eqs. 1 and 5. We note that this definition of 
bundles implies that features that are not part of the bundle explanation (i.e. 
features contained in “free” bundles) are “free” to be set to any possible value. 
Another possible alternative for defining bundles could be to allow features in 
“free” bundles to only change in the same, coordinated manner. We focus here 
on the former definition, and leave the alternative definition for future work. 

Many of the aforementioned results and definitions for explanations can be 
extended to bundle explanations. In a similar manner to Eq. 5, we can define the 
notions of minimal and minimum bundle explanations, a contrastive bundle sin- 
gleton, and contrastive bundle pairs (see Sec. D of the appendix of the full version 
of this paper [{11]). Theorems 1 and 2 can be extended to bundle explanations in 
a straightforward manner. It then follows that all bundle explanations contain 
all contrastive singleton bundles, and that all bundle explanations contain at 
least one bundle of any contrastive bundle pair. 

Our method from Secs. 3 and 4 can be similarly performed on bundles rather 
than on features, and Typ would then be used for calculating a minimal bundle 
explanation, rather than a minimal explanation. Regarding the aforementioned 
approximation ratio, we discuss and evaluate two different methods for obtaining 
it. The first, natural approach is to apply our techniques from Sec. 3 on bundle 
explanations, thus obtaining a provable approximation for a minimum bundle 
explanation. The upper bound is trivially derived by the size of the bundle ex- 
planation found by Typ, whereas the lower bound calculation requires assigning 
a cost to each bundle, representing the number of features it contains. This is 
done via a known notion of minimum hitting sets of bundles (MHSB) [6] and 
using minimum weighted vertex cover for the approximation of contrastive bun- 
dle pairs. This method, which is almost identical to the one mentioned in Sec. 3, 
is formalized in Sec. D of the appendix of the full version of this paper [11]. 

The second approach is to calculate an approximation ratio with respect to 
a regular, non-bundle minimum explanation. The minimal bundle explanation 
found by Typ is an upper bound on the minimum non-bundle explanation follow- 
ing theorem 5. For computing a lower bound, we can analyze contrastive bundle 
examples; extract from them contrastive non-bundle examples; and then use the 
duality property, compute an MHS of these contrastive examples, and derive 
lower bounds for the size of the minimum explanation. We formalize techniques 
for performing this calculation in Sec. E of the appendix of the full version of 
this paper [11]. 
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6 Evaluation 


Implementation and Setup. For evaluation purposes, we created a proof- 
of-concept implementation of our approach as a Python framework. Currently, 
the framework uses the Marabou verification engine [41] as a backend, although 
other engines may be used. Marabou is a Simplex-based DNN verification frame- 
work that is sound and complete [5,39—41,68,69], and which includes support for 
proof production [35], abstraction [15, 16,52, 60,67, 72], and optimization [62]; 
and has been used in various settings, such as ensemble selection [3], simpli- 
fication [22,43] repair [23,53], and verification of reinforcement-learning based 
systems [2,4,17]. For sorting features by their relevance, we used the popular XAI 
method LIME [55]; although again, other heuristics could be used. The MVC 
was calculated using the classic 2-approximating greedy algorithm. All experi- 
ments reported were conducted on x86-64 Gnu/Linux-based machines, using a 
single Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz core, with a 1-hour timeout. 


Benchmarks. As benchmarks, we used DNNs trained over the MNIST dataset 
for handwritten digit recognition [45]. These networks classify 28 x 28 grayscale 
images into the digits 0,...,9. Additionally, we used DNNs trained over the 
Fashion-MNIST dataset [71], which classify 28 x 28 grayscale images into 10 
clothing categories (“Dress”, “Coat”, etc.) For each of these datasets we trained 
a DNN with the following architecture: (i) an input layer (which corresponds 
to the image) of size 784; (ii) a fully connected hidden layer with 30 neurons; 
(iii) another fully connected hidden layer, with 10 neurons; and (iv) a final, 
softmax layer with 10 neurons, corresponding to the 10 possible output classes. 
The accuracy of the MNIST DNN was 96.6%, whereas that of the Fashion- 
MNIST DNN was 87.6%. (We note that we configured LIME to ignore the 
external border pixels of each input, as these are not part of the actual image.) 

In selecting the classification instances to be explained for these networks, 
we targeted input points where the network was not confident — i.e., where 
the winning label did not win by a large margin. The motivation for this choice 
is that explanations are most useful and relevant in cases where the network’s 
decision is unclear, which is reflected in lower confidence scores. Additionally, 
explanations of instances with lower confidence tend to be larger, facilitating 
the process of extensive experimentation. We thus selected the 100 inputs from 
the MNIST and the Fashion-MNIST datasets where the networks demonstrated 
the lowest confidence scores — i.e., where the difference between the winning 
output score and the runner-up class score was minimal. 


Experiments. Our first goal was to compare our approach to that of Ignatiev et 
al. [31], which is the current state of the art in verification-based explainability of 
DNNs. Other approaches consider other ML types, such as decision trees [33,36], 
or focus on alternative definitions for abductive explanations [42,70] and are 
thus not comparable. Because the implementation used in [31] is unavailable, we 
implemented their approach, using Marabou as the underlying verifier for a fair 
comparison. In addition, we used the same heuristic model, LIME, for sorting 
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Fig. 7: Our full and ablation-based results, compared to the state of the art for 
finding minimal explanations on the MNIST dataset. 


the input features’ relevance. Fig. 7 depicts a comparison of the two approaches, 
over the MNIST benchmarks. The Fashion-MNIST results were similar, but since 
the Fashion-MNIST network had lower accuracy it tended to produce larger 
explanations with lower run-times, resulting in less meaningful evaluations (due 
to space limitations, these results appear in Fig. 12 in the appendix of the full 
version of this paper [11]). We compared the approaches according to two criteria: 
the portion of input features whose participation in the explanation was verified, 
over time (part (a) of Fig. 7), and the average size of the presently obtained 
explanation over time, also presented as a fraction of the total number of input 
features (part (b)). The results indicate that our method significantly improves 
over the state of the art, verifying the participation of 40.4% additional features, 
on average, and producing explanations that are 9.7% smaller, on average, at 
the end of the 1-hour time limit. Furthermore, our method timed out on 10% 
fewer benchmarks. We regard this as compelling evidence of the potential of our 
approach to produce more efficient verification-based XAI. 

We also looked into comparing our approach to heuristic, non-verification- 
based approaches, such as LIME itself; but these comparisons did not prove 
to be meaningful, as the heuristic approaches typically solved benchmarks very 
quickly, but very often produced incorrect explanations. This matches the find- 
ings reported in previous work [14,32]. 

Next, we set out to evaluate the contribution of each of the components 
implemented within our framework to overall performance, using an ablation 
study. Specifically, we ran our framework with each of the components men- 
tioned in Sec. 4, i.e. (i) information exchange between Typ and Typ; (ii) the 
binary search in Typ; and (iii) local-singleton search, turned off. The results on 
the MNIST benchmarks appear in Fig. 7; see Fig. 12 in the appendix of the 
full version of this paper [11] for the Fashion-MNIST results. Our experiments 
revealed that each of the methods mentioned in Sec. 4 had a favorable impact 
on both the average portion of features verified, and the average size of the dis- 
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covered explanation, over time. Fig 7a indicates that the local-singleton search 
method, used for efficiently proving that features are bound to be included in 
the explanation, was the most significant in reducing the number of features 
remained for verifying, thus substantially increasing the portion of verified fea- 
tures. Moreover, Fig. 7b indicates that the binary search method, which is used 
for grouping UNSAT queries and proving the exclusion of features from the ex- 
planation, was the most significant for more efficiently obtaining smaller-sized 
explanations, over time. 


Our second goal was to evaluate the qual-  ” KT TE een rie r E 
ity of the minimum explanation approxima- os 
tion of our method (using the lower/upper 
bounds) over time. Results are averaged over 
all benchmarks of the MNIST dataset and are 
presented in Fig. 8 (similar results on Fashion- 
MNIST appear in Fig. 13 in the appendix of 
the full version of this paper [11]). The upper a S S S E 
bound represents the average size of the expla- Te) 
nation discovered by Tyg over time, whereas 
the lower bound represents the average lower 
bound discovered by Typ over time. It can be 
seen that initially, there is a steep increase in 
the size of the lower bound, as Tp discovered many contrastive singletons. Later, 
as we begin iterating over contrastive pairs, the verification queries take longer 
to solve, and progress becomes slower. The average approximation ratio achieved 
after an hour was 1.61 for MNIST and 1.19 for Fashion-MNIST. 
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Fig.8: Average approximation 
of minimum explanation over 
time. 


For our third experiment, we set out to assess the improvements afforded by 
bundles. We repeated the aforementioned experiments, this time using sets of 
features representing bundles instead of the features themselves. The segmenta- 
tion into bundles was performed using the quickshift method [65], with LIME 
again used for assigning relevance to each bundle [55]. We approximate the sizes 
of the bundle explanations in terms of both the minimum bundle explanation as 
well as the minimum (non-bundle) explanation (as mentioned in Sec. 5 and in 
Sec. E of the appendix of the full version of this paper [11]). The bundle con- 
figuration showed drastic efficiency improvements, with none of the experiments 
timing out within the 1-hour time limit, thus improving the portion of timeouts 
on the MNIST dataset by 84%. The efficiency improvement was obtained at the 
expense of explanation size, resulting in a decrease of 352% in the approxima- 
tion ratios obtained for MNIST and 39% for Fashion-MNIST. Nevertheless, when 
calculating the approximation in terms of the minimum bundle explanation, an 
increase of 12% and 8% was obtained for MNIST and Fashion-MNIST (results 
are summarized in Table 1 in the appendix of the full version of this paper [11]}). 
For a visual evaluation, we performed the same set of experiments for both bun- 
dle and non-bundle implementations, using instances with high confidence rates 
to obtain smaller-sized explanations that could be more easily interpreted. A 


202 S. Bassan and G. Katz 


(a) Original Image b) Explanation (c) Bundle explanation 


Fig. 9: Minimal explanations and bundle explanations found by our method on 
the Fashion-MNIST dataset. White pixels are not part of the explanation. 


sample of these results is presented in Fig. 9. Empirically, we observe that the 
bundle-produced explanations are less complex and more comprehensible. 

Overall, we regard our results as compelling evidence that verification-based 
XAI can soundly produce meaningful explanations, and that our improvements 
can indeed significantly improve its runtime. 


7 Related Work 


Our work is another step in the ongoing quest for formal explainability of DNNs, 
using verification [19, 27, 31,58]. Related approaches have applied enumeration 
of contrastive examples [30,31], which is also an ingredient of our approach. 
Other approaches focus on producing abductive explanations around an epsilon 
environment [42,70]. Similar work has been carried out for decision sets [33], 
lists [28] and trees [36], where the problem appears to be simpler to solve [36]. 
Our work here tackles DNNs, which are known to be more difficult to verify [39]. 

Prior work has also sought to produce approximate explanations, e.g., by us- 
ing -relevant sets [37,66]. This line of work has focused on probabilistic methods 
for generating explanations, which jeopardizes soundness. There has also been 
extensive work in heuristic XAI [50, 55,56,61], but here, too, the produced ex- 
planations are not guaranteed to be correct. 


8 Conclusion 


Although DNNs are becoming crucial components of safety-critical systems, they 
remain “black-boxes”, and cannot be interpreted by humans. Our work seeks to 
mitigate this concern, by providing formally correct explanations for the choices 
that a DNN makes. Since discovering the minimum explanations is difficult, we 
focus on approximate explanations, and suggest multiple techniques for expedit- 
ing our approach — thus significantly improving over the current state of the art. 
In addition, we propose to use bundles to efficiently produce more meaningful 
explanations. Moving forward, we plan to leverage lightweight DNN verification 
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techniques for improving the scalability of our approach [49], as well as extend 
it to support additional DNN architectures. 
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Abstract. Occlusion is a prevalent and easily realizable semantic perturbation 
to deep neural networks (DNNs). It can fool a DNN into misclassifying an input 
image by occluding some segments, possibly resulting in severe errors. Therefore, 
DNNs planted in safety-critical systems should be verified to be robust against 
occlusions prior to deployment. However, most existing robustness verification 
approaches for DNNs are focused on non-semantic perturbations and are not suited 
to the occlusion case. In this paper, we propose the first efficient, SMT-based 
approach for formally verifying the occlusion robustness of DNNs. We formulate 
the occlusion robustness verification problem and prove it is NP-complete. Then, 
we devise a novel approach for encoding occlusions as a part of neural networks 
and introduce two acceleration techniques so that the extended neural networks can 
be efficiently verified using off-the-shelf, SMT-based neural network verification 
tools. We implement our approach in a prototype called OccRos and extensively 
evaluate its performance on benchmark datasets with various occlusion variants. 
The experimental results demonstrate our approach’s effectiveness and efficiency in 
verifying DNNs’ robustness against various occlusions, and its ability to generate 
counterexamples when these DNNs are not robust. 


1 Introduction 


Deep neural networks (DNNs) are computer-trained programs that can implement 
hard-to-formally-specify tasks. They have repeatedly demonstrated their potential in 
enabling artificial intelligence in various domains, such as face recognition [6] and 
autonomous driving [27]. They are increasingly being incorporated into safety-critical 
applications with interactive environments. To ensure the security and reliability of these 
applications, DNNs must be highly dependable against adversarial and environmental 
perturbations. This dependability property is known as robustness and is attracting 
a considerable amount of research efforts from both academia and industry, aimed at 
ensuring robustness via different technologies such as adversarial training [13,28], testing 
[40,33], and formal verification [34,10,5]. 

Occlusion is a prevalent kind of perturbation, which may cause DNNs to misclassify 
an image by occluding some segment thereof [38,25,8]. For instance, a “turn left” traffic 
sign may be misclassified as “go straight” after it is occluded by a tape, probably resulting 
in traffic accidents. A similar situation may occur in face recognition, where many well- 
trained neural networks fail to recognize faces correctly when they are partially occluded, 
such as when glasses are worn[37]. A neural network is called robust against occlusions 
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if small occlusions do not alter its classification results. Generally, we wish a DNN to be 
robust against occlusions that appear negligible to humans. 

It is challenging to verify whether a DNN is robust or not on an input image if the 
image is occluded. On the one hand, the verification problem is non-convex due to the 
non-linear activation functions in DNNs. It is NP-complete even when dealing with 
common, fully connected feed-forward neural networks (FNNs) [20]. On the other hand, 
unlike existing perturbations, occlusions are challenging to encode using L, norms. Most 
existing robustness verification approaches assume that perturbations need to be defined 
by L, norms and then apply approximations and abstract interpretation techniques 
[34,10,5] as part of the verification process. The semantic effect of occlusions partially 
alters the values of some neighboring pixels from large to small or in the inverse direction, 
e.g., 255 to 0, when a black occlusion occludes a white pixel. Therefore, existing 
techniques for perturbations in L, norms are not suited to occlusion perturbations. 

SMT-based approaches have been shown to be an efficient approach to DNN ver- 
ification [20]. They are both sound and complete, in that they always return definite 
results and produce counterexamples in non-robust cases. We show that, although it is 
straightforward to encode the occlusion robustness verification problem into SMT for- 
mulas, solving the constraints generated by this naive encoding is experimentally beyond 
the reach of state-of-the-art SMT solvers, due to the inclusion of a large number of the 
piece-wise ReLU activation functions. Consequently, such a straightforward encoding 
approach cannot scale to large networks. 

In this paper, we systematically study the occlusion robustness verification problem 
of DNNs. We first formalize and prove that the problem is NP-complete for ReLU- 
based FNNs. Then, we propose a novel approach for encoding various occlusions and 
neural networks together to generate new equivalent networks that can be efficiently 
verified using off-the-shelf SMT-based robustness verification tools such as Marabou 
[21]. In our encoding approach, although additional neurons and layers are introduced 
for encoding occlusions, the number is reasonably small and independent of the networks 
to be verified. The efficiency improvement of our approach comes from the fact that our 
approach significantly reduces the number of constraints introduced while encoding the 
occlusion and leverages the backend verification tool’s optimization against the neural 
network structure. Furthermore, we introduce two acceleration techniques, namely input- 
space splitting to reduce the search space of a single verification, which can significantly 
improve verification efficiency, and label sorting to help verification terminates earlier. 
We implement a tool called OccRos with Marabou as the backend verification tool. To 
our knowledge, this is the first work on formally verifying the occlusion robustness of 
deep neural networks. 

To demonstrate the effectiveness and efficiency of OccRos, we evaluate it on six 
representative FNNs trained on two benchmark datasets. The empirical results show 
that our approach is effective and efficient in verifying various types of occlusions with 
respect to the occlusion position, size, and occluding pixel value. 

Contributions. We make the following three major contributions: (i) we propose a novel 
approach for encoding occlusion perturbations, by which we can leverage off-the-shelf 
SMT-based robustness verification tools to verify the robustness of neural networks 
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against various occlusion perturbations; (ii) we prove the verification problem of the 
occlusion robustness is NP-complete and introduce two acceleration techniques, i.e., 
label sorting and input space splitting, to improve the efficiency of verification further; 
and (iii) we implement a tool called OccRos and conduct experiments extensively on a 
collection of benchmarks to demonstrate its effectiveness and efficiency. 
Paper Organization. Sec. 2 introduces preliminaries. Sec. 3 formulates the occlusion 
robustness verification problem and studies its complexity. Sec. 4 presents our encoding 
approach and acceleration techniques for the verification. Sec. 5 shows the experimental 
results. Sec. 6 discusses related work, and Sec. 7 concludes the paper. 

We omit the complete proofs and experimental results due to the page limit. Please 
refer to the technical report [15] for more details. 


2 Preliminaries 


2.1 Deep Neural Networks and the Robustness 


As shown in Fig. 1, a deep neural network =?" Tigaon idden Output 
, = layer layer layer layer 
consists of multiple layers. The neurons on m pb) we ` b 
$ : : w @ 
the input layer take input values, which are 5 , W® pO 
computed and propagated through the hid- SO OC @ SZA@ @ 


den layers and then output by the output x S 
layer. The neurons on each layer are con- L i 
nected to those on the predecessor and suc- @— n~ y 
cessor layers. We only consider fully con- 
nected, feedforward networks (FNNs) [11]. 

Given a A-layer neural network, let WO 
be the weight matrix between the (i — 1)-th 
and i-th layers, and b be the biases of the corresponding neurons, where i = 1,2,...,2. 
The network implements a function F : R“ — R” that is recursively defined by: 


Fig.1: A fully-connected feed-forward 
neural network (FNN). 


BOY cep 

; : f ; (Layer Function) 
 =o(W® -z250 +b), fori=1,..,,A—1 ý 
F(x) = W® . 24-D 4. pb (Network Function) 


where o(-) is called an activation function and z denotes the result of neurons at the 
i-th layer. 

For example, Fig. | shows a 3-layer neural network with three input neurons and two 
output neurons, namely, 2 = 3, u = 3 andr = 2. 

For the sake of simplicity, we use f(x) = arg max;-; F(x) to denote the label £ 
such that the probability F(x) of classifying x to £ is larger than those to other labels, 
where L represents the set of labels. The activation function o usually can be a piece-wise 
Rectified Linear Unit (ReLU), a(x) = max(x,0)), or S-shape functions like Sigmoid 
o(x) = T Tanh o(x) = << , or Arctan o(x) = tan~! (x). In this work, we focus on 
the networks that only contain ReLU activation functions, which are widely adopted in 
real-world applications. 
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fe 


(a) Multiform: 30km/h (b) Origin 70km/h (c) Uniform: 30km/h (d) Origin 70km/h 


Fig. 2: Two multiform and uniform occlusions to traffic signs causing mis-classifications. 


A neural network is called robust if small perturbations to its inputs do not alter the 
classification result [39]. Specifically, given a network F, an input xo and a set Q of 
perturbed inputs of xo, F is called locally robust with respect to x9 and Q if F classifies 
all the perturbed inputs in Q to the same label as it does xo. 


Definition 1 (Local Robustness [17]). A neural network F : R" — R” is called locally 
robust with respect to an input xo and a set Q of perturbed inputs of x if Yx € Q, r(x) = 
Pr(Xxo) holds. 


Usually, the set Q of perturbed inputs is defined by an €,-norm ball around xo with a 
radius of e, i.e., Bp(xo, €) := {x | Ilx — Xollp < €} [17,2]. 


2.2 Occlusion Perturbation 


In the context of image classification networks, occlusion is a kind of perturbation that 
blocks the pixels in certain areas before the image is fed into the network. Existing 
studies showed that the classification accuracy of neural networks could be significantly 
decreased when the input objects are artificially occluded [23,44]. 

Occlusions can have various occlusion shapes, sizes, colors, and positions. The 
shapes can be square, rectangle, triangle, or irregular shape. The size is measured by the 
number of occluded pixels. The occlusion color specifies the colors occluded pixels can 
take. The coloring of an occlusion can be either uniform, where all occluded pixels share 
the same color, or multiform, where these colors can vary in the range of [—e, €], where 
e specifies the threshold between an occluded pixel’s value and its original value. 

Prior studies [8,3] showed that both the uniform and 
multiform occlusions could cause misclassification to neu- 
ral networks. Fig. 2 shows two examples of multiform é e ə e ọ 
and uniform occlusions, respectively. The traffic sign 


@_e__e__e—_® 


for “70km/h speed limit” in Fig. 2(a) is misclassified to @ e o 
“30km/h” by adding a 5 x 5 multiform occlusion. Fig. 2(d) 
shows another sign, with different light conditions, where @ e ® 


a 3 x3 uniform occlusion (in Fig. 2(c)) causes the sign to d 

be misclassified to “30km/h”. e < < < ə 
The occlusion position is another aspect of defining Fig. 3: An example occlu- 

occlusions. An occlusion can be placed precisely on the gion on a5 X 5 image at 

pixels of an image, or between a pixel and its neighbors. teal number position. 

Fig. 3 shows an example, where the dots represent image 
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pixels and the circles are the occluding pixels that will substitute the occluded ones. 
We say that an occlusion pixel 3), at location (i, j’) surrounds an image pixel p; j at 
location (i, j) if and only if |i — i’| < 1 and |j — j’| < 1. Note that 7’, 7’ are real numbers, 
representing the location where the occlusion pixel o is placed on the image. An image 
pixel can be occluded by the substitute occlusion pixels if the occlusion pixels surround 
the image pixel. 

There are at most four surrounding occlusion pixels for each image pixel, as shown 
in Fig. 3. Let I, be the set of the locations where the surrounding occlusion pixels of p 
are placed. After the occlusion, the value of pixel p; j is altered to the new one denoted by 
P; p which can be computed by interpolation [19,22] such as next neighbour interpolation 
or Bi-linear interpolation based on occlusion pixels in I„. Besides that, we use a method 
based on L;-distance to calculate how much a pixel is occluded. Since the L,-distance of 
two adjacent pixels is 1, a surrounding occlusion pixel should not affect the image pixel 
if their L-distance is greater than 1. The formula max(0, (1 -— 7’ +7)J)+ U0 -j’ + f))- 1) 
indicates how much an image pixel at (i, j) is occluded by an occlusion pixel at (i’, j’). For 
instance, occlusion pixel at (č, j’) = (0.9, 0.9) has no effect to image pixel (i, j) = (0,0) 
since their L;-distance is larger than 1. Therefore, the occlusion factor s; j for pixel p at 
(i, j) can be calculated based on all surrounding occlusion pixels in Iņ as: 


Sij = max, Yn per, = 7+ J’) + Lena, Ql i + i) — 1) (1) 


where (i, Jọ) is the first element of Ip. Notably, s is 1 for completely occluded pixel and 
0 for the pixel that is not occluded, otherwise s has a value between (0, 1). Also, it is a 
special case for Equation 1 when (i', j’) are integers, where s can be reduced to 0 or 1. 


3 The Occlusion Robustness Verification Problem 


Let R™” be the set of images whose height is m and width is n. We use Ni,» (resp. 
Nin) to denote the set of all the natural numbers ranging from 1 to m (resp. n). A 
coloring function ¢ : R” x R x R > Ris a mapping of each pixel of an image to its 
corresponding color value. Given an image x € R™™”, £(x, i, j) defines the value to color 
the pixel of x at (i, j). 


Definition 2 (Occlusion function). Given a coloring function € and an occlusion 9 of 
size w X h, the occlusion function is defined as function Yewxn : R™®" X RX R > R” 
such that x’ = Yt wxn(x,a, b) if for alli € Ni, and j € Nim, there is, 


Xj = Xij Siz X (xij = f(x, i, D) (2) 
Leen, Ory VE- iP +G- jY 
Lely, VG = i’) + (j = TY 


where, ¢(x,i, j) = (3) 


s in Equation 2 is the occlusion factor for pixel at (i, j) as mentioned in Sec. 2.2. Note 
that when 7’, j’ are integers, Equation 2 can be reduced to x;,; = ®; j, which represents 
that x; j is completely occluded by the occlusion. In other words, the integer case is a 
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special case of the real number case. Also, when pixel at (i, j) is not occluded, since 
Sij = 0. In this case, Equation 2 can be reduced to xj = Xij 

Interpolation is handled by ¢ showed in Equation 3. It shows the standard form 
for the color of the new x; ,. A unique color value is specified for all the pixels in the 
occluded area for a uniform occlusion. Therefore, é in Equation 3 can be reduced to 
C(x, i, j) = u for some u € [0,1]. The coloring function in a multiform occlusion is 
defined as ¢(x, i, j) = x;,; + 4, with 4, € [-e, €], where € € R defines the threshold that 
a pixel can be altered. 


Definition 3 (Local occlusion robustness). Given a DNN F : R®” — R’, an occlusion 
function Yewxn : R" X RXR > R™” with respect to coloring function ¢ and occlusion 
size w X h, and an input image x, F is called local occlusion robust on x with Yt wxh if 
Dr(x) = Dr(¥ewxn(x, a, b)) holds for alll <a < nand 1 <b < m. 


Intuitively, Definition 3 means that F is robust on x against the occlusions of Ygyxj, if 
on any occluded image of x by the occlusion function Yz wxn, F always returns the same 
classification result as on the original image x. Depending on the coloring function Z, the 
definition applies to various occlusions concerning shapes, colors, sizes, and positions. 
We can also extend the above definition to the global occlusion robustness if F is robust 
on all images concerning Yz,wxh- 

We prove that even for the case of uniform occlusion, a special case of the multiform 
one, the local occlusion robustness verification problem is NP-complete on the ReLU- 
based neural networks. 


4 SMT-Based Occlusion Robustness Verification 
4.1 A Naive SMT Encoding Method 


The verification problem of FNNs’ local occlusion robustness can be straightforwardly 
encoded into an SMT problem. In Definition 3, we assume that x is classified by @ to the 
label £4, i.e., P(x) = €,, for a label 4 € L. To prove F is robust on x after x is occluded 
by occlusion # with size w x A, it suffices to prove that F classifies every occluded image 
x’ = ¥ewxala, b) to €, for all 1 < a < nand 1 < b < m. This is equivalent to proving that 
the following constraints are not satisfiable: 


1<a<n,1<b<m, (4) 
NieWin.jeNim 
Ka- l<i<at+twtlAb-1l<j<bt+h+1) AX; = YewxnlX, a, b)i j)V (5) 
(izat+w+lVGsa-lV(j>bth+)VGSb-D) Ax, = xi), 
Vincita ENSP (6) 
The conjuncts in Eq. 5 define that x’ is an occluded instance of x, and the disjuncts 
in Eq. 6 indicate that, when satisfiable, there exists some label ¢; which has a higher 


probability than £; to be classified to. Namely, the occlusion robustness of F on x is 
falsified, with x’ being a witness of the non-robustness. Note that this naive encoding 
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Fig. 4: The workflow of encoding and verifying FNN’s robustness against occlusions. 


considers the occlusion position’s real number cases since function y implicitly includes 
the interpolation. 

Although the above encoding is straightforward, solving the encoded constraints is 
experimentally beyond the reach of general-purpose existing SMT solvers due to the 
piece-wise linear ReLU activation functions in the definition of F in the constraints of 
Eq. 6, and the large search space m X n X (2e)"*" (see Experiment II in Sec. 5). 


4.2 Our Encoding Approach 


An Overview of the Approach. To improve efficiency, we propose a novel approach 
for encoding occlusion perturbations into four layers of neurons and concatenating the 
original network to these so-called occlusion layers, constituting a new neural network 
which can be efficiently verified using state-of-the-art, SMT-based verifiers. 

Fig. 4 shows the overview of our approach. Given an input image and an occlusion, 
we first construct a 3-hidden-layer occlusion neural network (ONN) and then concatenate 
it to the original FNN by connecting the ONN’s output layer to the FNN’s input layer. 
The combined network represents all possible occluded inputs and their classification 
results. The robustness of the constructed network can be verified using the existing 
SMT-based neural network verifiers. 

We introduce two acceleration techniques to speed up the verification further. First, 
we divide the occlusion space into several smaller, orthogonal spaces, and verify a finite 
set of sub-problems on the smaller spaces. Second, we employ the eager falsification 
technique [14] to sort the labels according to their probabilities of being misclassified to. 
The one with a larger probability is verified earlier by the backend tools. Whenever a 
counterexample is returned, an occluded image is found such that its classification result 
differs from the original one. If all sub-problems are verified and no counterexamples are 
found, the network is verified robust on the input image against the provided occlusion. 


Encoding Occlusions as Neural Networks. Given a coloring function ¢, an occlusion 
size w X h and an input image x of size m Xn, we construct a neural network O : R*+* > 
R’”" to encode all the possible occluded images of x, where c = 1 if x is a grey image 
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and c = 3 if x is an RGB image, t = 0 for the uniform occlusion and t = w x h for the 
multiform one. 

Fig. 5 shows the neural network architecture for encoding occlusions. We divide it 
into a fundamental part and an additional part. The former encodes the occlusion position 
and the uniform occlusion color. The additional part is needed only by the multiform 
occlusion to encode the coloring function. Without loss of generality, we assume that 
the input layer takes the vector (a, w, b, h, 2), where (a, b) is the top-left coordinate of 
occlusion area in x. The coloring function ¢ is admitted by other c x t neurons in the 
input layer when the occlusion is multiform. 


(1) Encoding occlusion positions. Input 1% hidden 24 hidden 3"? hidden Output 
We explain the weights and biases D asia ayat si 
that are defined in the neural net- Ak a Ei i ea 
work to encode the occlusion posi- P RT i wa . 
tion. On the connections between mg :————_> . pm 
the input layer and the first hid- © © 
den layer, the weights in matrices 
Wii, Wiz and W,3 are 1, -1 and Ware 
-1, respectively. Note that we hide o e 
all the edges whose weights are mi. -pm 
O in the figure for clarity. The bi- 8 ° © 
ases in bıı are (-1,—2,...,—m) ia ae 
for the first m neurons on the first t 
hidden layer. Those in bj are es A 
(2,3,...,m + 1). The weights in Saree m4 : lèm 
Wi 4, Wis, Wi. and the biases in e, © © @ 
bis and bı4 are defined in the Pogara Wael [Wax 
same way. We omit the details due l 
to the page limitation. e T E EE E EN: 

For the second layer, the diag- i o on 


onals of weight matrices W2, to 
W24 are set to -1, and the rest of Fig. 5: An occlusion neural network for the occlu- 
their entries are 0. The biases in sions on an image x with ¢ and w x h. 

b21 and b22 are 1. After the prop- 

agation to the second hidden layer, a pixel at position (i, j) in the image x is occluded if 
and only if both the outputs of the i” neuron in the first m neurons and the j” neuron in 
the remaining n neurons on the second hidden layer are 1. 

The third hidden layer represents the occlusion status of each pixel in the original 
image x. 2n weight matrices connect the second layer and the n x m neurons of the 
third layer. For example, we consider the weights in W3; and W3 n+; which connect the 
i” group of m neurons in the third layer to the second layer. The size of W3, is m X m, 
and the weights in the i row are 1 while the rest is 0. The size of W3 n+i is m X n. The 
weights on its diagonal are set to 1, while the rest are set to 0. All the biases in b3,; to 
D3.n are -1. The output of the third layer indicates the occlusion status of all the pixels. If 
a pixel at (i, j) is occluded, then the output of the (i x m + j)" neuron in the third layer is 
1, and otherwise, 0. 
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(2) Encoding Coloring Functions. We consider the uniform and multiform coloring 
functions separately for verification efficiency, although the former is a special case of the 
latter. We first consider the general multiform case. In the multiform case, we introduce 
2 x n x m extra neurons in the third hidden layer, as shown in the bottom part of Fig. 
5. These neurons can be combined with the third layer, but it would be more clear to 
separate them. The weight matrix W3, connects the third layer to these neurons, with 
its first half of diagonal set to 1, and the second half set to -1. This helps retain the sign 
of the input ¢ during propagation. The weight matrix Wz; connects the input ¢ to these 
neurons, whose diagonal are 1, and the biases b; are -1. These neurons work just like 
the third layer, except that they not only represent the occlusion status of pixels, but also 
preserve the input ¢. If a pixel at (i, j) is occluded and ¢ has a positive value, then the 
(ix m+ j) output in the first half of them is ¢. The (i xm + j) output in the second half 
is ¢ when Z has a negative value. Otherwise, the output is 0. In the uniform case, it can be 
encoded together with input images, and we thus explain it in the following paragraph. 


(3) Encoding Input Images. In the fourth layer, we use W4 to denote the weight matrix 
connecting the third layer. W; is used to encode pixel values of the input image x and the 
coloring function ¢ of occlusions. In the uniform case, the weight w(i, i) in the diagonal 
of W, is w(i, i) = ¢; — x; and the biases b4 = x where x is the flattened vector of the 
original input image. In the multiform case, the weight matrix W4, connects the neurons 
in the bottom part that preserves information of input ¢ to the fourth layer. The first 
half of W4, is identical to W4, and the second half of W4, has its diagonal set to -1. It 
provides the value of the coloring function ¢ with any sign for each occluded pixel. The 
output of the j’” neuron in the i” group of the fourth layer is the raw pixel value plus ¢ if 
the pixel at (i, j) is occluded; otherwise, it is the raw pixel value of p. 


An Illustrative Example. We show an example of constructing the occlusion network 
on a 2 x 2, single-channel image in Fig. 6. In this example, we assume that the input 
image is x = [0.4,0.6,0.55,0.72] and the occlusion applied to x has a size of 1 x 
1, which means w = 1 and h = 1. For uniform occlusion, the coloring function ¢ 
has a fixed value of 0, and for multi- 
form case, the threshold e that a pixel 
can be altered is 0.1. 

We suppose the occlusion is ap- 
plied at position (1,2), which means 
a = l and b = 2 for the input of oc- 
clusion network. In the forward prop- 
agation, we calculate the output of 
the first layer by a x Wi, + bia and 
axWi2+bxWi3+ bia and can get 
(0,0,0, 1) for the first four neurons. 
Following the same process, we get 
the output of the second 4 neurons, 
(1,0,0,0). After propagation to the 
second layer, it outputs (1,0), (0, 1) 
based on W2,1, W2,2 and bo, represent- 
ing the second column and the first 


Fig. 6: An example of encoding a one-pixel uni- 
form occlusion as a neural network. 
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row of x is under occlusion. Likely, the third layer outputs (0, 1,0,0) based on its weight 
matrices and biases, representing that the second pixel in the first row is occluded. Af- 
ter propagation to the fourth layer, the occlusion network outputs an occluded image 
x’ = [0.4,0,0.55, 0.72] based on W4 and b4. It is identical to the expected occluded 
image, where the second pixel is occluded, and other pixels stay unchanged. Suppose we 
change a to some real number, for instance, 1.5. After the same propagation, we will get 
an output of (0, 0.5, 0, 0.5) in the third layer, representing that the neurons in the second 
column are affected by the occlusion by a factor of 0.5. The fourth layer then outputs 
[0.4, 0.3, 0.55, 0.36], which is the corresponding occluded image x’. 


In the multiform case, as mentioned at the first, we suppose the threshold e = 0.1, 
and keep all other settings. Then after the same propagation to the third layer, the third 
layer would output (0, 1, 0,0), representing that the second pixel is occluded. Those extra 
neurons then output (0,0.1,0,0,0,0,0,0) where the second neuron in the first half is 
0.1 and 0 for the remaining. This indicates both that the second pixel in the first row is 
occluded, and has an epsilon of 0.1. After propagation to the fourth layer, the occlusion 
network outputs x’ = [0.4,0.7,0.55,0.72] based on its W4 and b4. As expected, the 
second pixel is occluded and increases by 0.1, and other pixels stay unchanged. For the 
case of a negative € of —0.1, the extra neurons output (0, 0, 0, 0, 0, 0.1, 0, 0). Note that the 
second neuron in the second half is 0.1 and the remaining are 0, which helps retain the 
sign of —0.1. The fourth layer then outputs [0.4, 0.5, 0.55, 0.72], which is the expected 
occluded image where the second pixel decreases by 0.1. 


4.3 The Correctness of the Encoding 


Given an input image x, a rectangle occlusion of size w x h, and a coloring function <4, 
let O be the corresponding occlusion neural network constructed in the approach above. 
Let F be the FNN to verify. We concatenate O to F by connecting O’s output layer to 
F’s input layer. The combined network implements the composed function F o O. The 
problem of verifying the occlusion robustness of F on the input image x is reduced to a 
regular robustness verification problem of F o O. 


Theorem 1 (Correctness). An FNN F is robust on the input image x with respect 
to a rectangle occlusion in the size of w x h and a coloring function ¢ if and only if 
Proo((a, w,b,h,)) = ®r(x) foralll <a<nand\1<b<m. 


Theorem | means that all the occluded images from x are classified by F to the same 
label as x, which implies the correctness of our proposed encoding approach. To prove 
Theorem 1, it suffices to show that the encoded occlusion neural network represents 
all the possible occluded images. In other words, when being perceived as a function, 
the network outputs the same occluded image as the occlusion function for the same 
occlusion coordinate (a, b), as formalized in the following lemma. 
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Lemma 1. Given an occlusion function yz wxn : R®" X RX R > R" and an input 
image x, let Oy,x : Rt — R” be the corresponding occlusion neural network. There 
is Ycwxn(x, a,b) = Oy,(a,w,b,h, 2) for alll <a < nand1 <b<m. 


Proof (Sketch). It suffices to prove yc wxa(x, a, b)i j = Oy x(a, w, b, h, &)i j for all i € Nin 
and j € Nim. By Definition 2, we consider the following two cases: 


Case 1: When a pixel p at position (i, j) is fully occluded, we have Yzwxn(X, a, b)i j = 
(x, i, j). We need to prove that Oy,,(a, w, b, h, £)i,; = &(x, i, j). 


Suppose p is covered by an arbitrary uniform occlusion with size of wọ X ho at position 
(ao, bo). We can observe that for that pixel p, i >a) A i < aọ + wọ — l and j > bo A j< 
bo + ho — 1 hold since p is covered by the occlusion. 

We show the output of O,,x(a, w, b, h, ¢)i, j by inspecting the (i * n + j)” output of 
the occlusion network after propagation, starting from inspecting the output of the i” 
and (i + m)" neurons of the first layer. According to the network structure discussed in 
Sec. 4.2, we can tell that the i” neuron in the first layer is 0 only when i > ao, the same 
property holds for the (i +m)" neuron when i < ap + wo — 1. Therefore, the output for 
the i and (i + m)” neurons of the first layer is 0, which leads to the i” neuron in the 
first part of the second layer has output of value 1. Through the similar process, we can 
get that the value of ae in the second part of the second layer is also 1. 


The (i x n + j)} neuron in the third layer is based on the i” neuron and j’” neuron 


of the second layer that we just discussed. Therefore, the output of that neuron, AA +p 
is 1. For uniform occlusion, suppose the coloring function ¢ has a fixed value uo. By 
propagating the output zo +j 10 the fourth layer, which is calculated as W4 x z°) + by, the 
(ix n+ j)" output of the fourth layer is 1 X (uo — pi, j) + pij = Ho. Likely, for multiform 
occlusion, ¢ indicates the threshold € that a pixel can change. The (i x n + j)" extra 
neuron outputs €o , then the corresponding neuron in the fourth layer outputs p; j + ©. 

This output of O, x(a, w, b, h, ¢)i, j is identical to Y wxa(x, a, b); j, the expected pixel 
value at position (i, j), which also indicates that the color is correctly encoded. 


Case 2: When a pixel p at position (i, j) is not occluded, we have Y¢xn(X, 4, b)i j = Xi,j- 
Then, we need to prove that O, (a, w, b, h, £)j,j = Xi,j- 


In this case, we can observe that i < dg Vi = dy + Wo and j < bo V j = bo + ho hold 
for pixel p. Then We can tell that the corresponding neuron in the third layer outputs 0 
and the output of the (i * n + j)" neuron in the fourth layer is the origin pixel value of p 
following the similar process discussed in case 1. 


For the occlusion with real number position, some more cases need to be discussed, 
but the proof has a very similar sketch as the normal occlusion with integer position. 
We leverage the equality of a x b = exp(log(a) + log(b)) and add it to the propagation 
between the third layer and those extra neurons only when the occlusion is at real 
number positions in the multiform case. And we use ReLU(a + b — 1) as an alternative 
to logarithms and exponents in implementation since Marabou does not support such 
operations. Due to the page limit, please refer to [15] for the details of the full proof. 

Theorem | can be directly derived from Lemma | and Definition 3 by substituting 
Ycwxh(X, a, b) for O, x(a, w, b, h, ¢) in the definition. 
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4.4 Verification Acceleration Techniques 


Existing SMT-based neural network verification tools can directly verify the composed 
neural network. The number of ReLU activation functions in the network is the primary 
factor in determining the verification time cost by the backend tools. In the occlusion 
part, the number of ReLU nodes is independent of the scale of the original networks to 
be verified. Therefore, our approach’s scalability relies only on the underlying tools. 

To further improve the verification efficiency, we integrate two algorithmic accelera- 
tion techniques by dividing the verification problem into small independent sub-problems 
that can be solved separately. 


Occlusion Space Splitting. We observed that verifying the composed neural network 
with a large input space can significantly degrade the efficiency of backend verifiers. 
Even for small FNNs with only tens of ReLUs, the verifiers may run out of time due to 
the large occlusion space for searching. For instance, the complexity of Reluplex [20] 
can be derived from the original SMT method of Simplex [32]. It has a complexity of 
Q(v x m X n), where m and n represent the number of constraints and variables, and v 
represents the number of pivots operated in the Simplex method. In the worst case, v 
can grow exponentially. Reduction in the search space can reduce the number of pivot 
operations, therefore significantly improving verification efficiency. 

Based on the above observation, we can divide [1, m] (resp. [1,n]) into km € Z* (resp. 
kn € Z*) intervals [mo,my],..., [Mk ,-1; Mk„] (resp. [no,m1],..., [Nk,-1; nk, ]) and verify 


m 


the problem on the Cartesian product of the two sets of intervals. 
Yx € X.@(x’) = D(x) = Aron Yx € Xi). P(x’) = D(x), where 


kin-1kn-1 kin—1,kn-1 
X= Bae Xap = Wap Ve wxn(%, a, b)\m; <a < Miz1,Nj S bs Nj+1}. 


(7) 


In this way, we split the occlusion space into km X kn sub-spaces. It is equivalent to prove 
Yx’ € X.Ø(x') for all Xg, j with O < i < km and O < j < kn, without losing the soundness 
and completeness. We call each verification instance a query, which can be solved more 
efficiently than the one on the whole occlusion space by backend verifiers. Furthermore, 
another advantage of occlusion space splitting is that these divided queries can be solved 
in parallel by leveraging multi-threaded computing. 


Eager Falsification by Label Sorting. Another Divide & Conquer approach for ac- 
celeration is to divide the verification problem into independent sub-problems by the 
classification labels in L, as defined below: 


Vx! € X.D) = Dx) = Yx EX Npe DBX) = C V BX) EC. (8) 


The dual problem to disprove the robustness can be solved to find some label €’ such 
that D(x) + €’ A D(x’) = l. We can first solve those that have higher probabilities of 
being non-robust. Once a sub-problem is proved non-robust, the verification terminates, 
with no need to solve the remainder. Such approach is called eager falsification [14]. 
Based on this methodology, we sort the sub-problems in a descent order according to 
the probabilities at which the original image is classified to the corresponding labels 
by the neural network. A higher probability implies that the image is more likely to be 
classified to the corresponding label. Heuristically, there is a higher probability of finding 
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Table 1: Occlusion verification results on two medium FNNSs trained on MNIST and 
GTSRB in different occlusion sizes 2 x 2 and 5 x 5 and occlusion radius €. 

Medium FNN (600 ReLUs) on MNIST | Medium FNN (343 ReLUs) on GTSRB 
Size E -/+ Ty T- Tpuita | TO(%) -/+ T, T_ Touiia | TO(%) 
0.05} 2/28) 120.01 11.98 0.068 0.00} 8/13) 103.64 24.18 0.089 0.00 
0.10; 3/27)121.37 19.18 0.067 0.00} 8/13} 108.62 22.57 0.088 0.00 
2x2)0.20) 4/26/122.12 39.57 0.067 0.00} 10/11) 113.7 23.17 0.084 0.00 
0.30) 6/24|165.98 45.6 0.086 0.00} 11/10) 117.97 26.41 0.089 0.00 
0.40} 7/23) 183.65 47.32 0.098 4.75| 14/7] 115.49 31.53 0.096 0.14 
0.05) 5/25]123.45 49.04 0.065 0.00} 9/12}]123.99 26.02 0.101 0.00 
0.10} 6/24)|124.13 44.09 0.073 0.00} 12/9}|127.65 26.96 0.01 0.00 
5x5|0.20) 10/20] 179.89 52.51 0.073 3.26| 16/5| 126.98 27.22 0.102 0.00 
0.30 | 14 / 16 | 284.67 65.98 0.076 5.45} 18/3) 146.68 29.11 0.100 0.04 
0.40; 22/8) 339.78 97.28 0.074 7.33} 19/2)169.17 26.52 0.103 0.09 


*“_/ +: the numbers of non-robust and robust cases; T, (resp. T_): average verification time in 
robust (resp. non-robust) cases; Tpuia: the building time of occlusion neural networks; TO 
(%): the percentage of runtime-out cases among all the queries. 


an occlusion such that the occluded image is misclassified to that label. We sequence 
the queries into backend verifiers until all are verified, or a non-robust case is reported. 
Our experimental results will show that this approach can achieve up to 8 and 24 times 
speedup in the robust and non-robust cases, respectively. 


5 Implementation and Evaluation 


We implemented our approach in a Python tool called OccRos, using the PyTorch 
framework. As a backend tool, we chose the Marabou [21] state-of-the-art, SMT-based 
DNN verifier. We evaluated our proposed approach extensively on a suite of benchmark 
datasets, including MNIST [24] and GTSRB [16]. The size of the networks trained 
on the datasets for verification is measured by the number of ReLUs, ranging from 
70 to 1300. All the experiments are conducted on a workstation equipped with a 32- 
core AMD Ryzen Threadripper CPU @ 3.7GHz and 128 GB RAM and Ubuntu 18.04. 
We set a timeout threshold of 60 seconds for a single verification task. All code and 
experimental data, including the models and verification scripts can be accessed at 
https://github.com/MakiseGuo/OccRob. 

We evaluate our proposed method concerning efficiency and scalability in the occlu- 
sion robustness verification of ReLU-based FNNs. Our goals are threefold: 


1. To demonstrate the effectiveness of the proposed approach for the robustness verifi- 
cation against various types of occlusion perturbations. 

2. To evaluate the efficiency improvement of the proposed approach, compared with 
the naive SMT-based method. 

3. To demonstrate the effectiveness of the acceleration techniques in efficiency im- 
provement. 


Experiment I: Effectiveness. We first evaluate the effectiveness of OccRos in robustness 
verification against various types of occlusions of different sizes and color ranges. Table 1 
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Fig. 7: Occlusive adversarial examples automatically generated for non-robust images. 


shows the verification results and time costs against multiform occlusions on two medium 
FNNs trained on MNIST and GTSRB. We consider two occlusion sizes, 2 x 2 and 5 x 5, 
respectively. The occluding color range is from 0.05 to 0.40. In each verification task, 
we selected the first 30 images from each of the two datasets and verified the network’s 
robustness around them, under corresponding occlusion settings. As expected, larger 
occlusion sizes and occluding color ranges imply more non-robust cases. One can see 
that OccRos can almost always verify and falsify each input image, except for a few 
time-outs. The robust cases cost more time than the non-robust cases, but all can be 
finished in a few minutes. Note that the time overhead for building occlusion neural 
networks is almost negligible, compared with the verification time. The effectiveness 
against uniform occlusions is shown in the following experiment. 

Fig. 7 shows several occlusive adversarial examples that are generated by OccRos 
under different occlusion settings. These occlusions do not alter the semantics of the 
original images and should be classified to the same results as those non-occluded ones. 
However, they are misclassified to other results. 


Experiment II: Efficiency improvement over the naive encoding method. We com- 
pare the efficiency of OccRos with that of a naive SMT encoding approach on verifying 
uniform occlusions since the naive encoding approach cannot handle verification against 
multiform occlusions. We apply the same acceleration techniques, such as parallelization 
and a variant of input space splitting, to the naive approach, which otherwise times out 
for almost all verification tasks even on the smallest model. 

Table 2 shows the average verification time on six FNNs of different sizes against 
uniform occlusions. We can observe that OccRos affords a significant improvement in 
efficiency, up to 30 times higher than the naive approach. It can always finish before 
the preset time threshold, while the naive method fails to verify the two large networks 
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under the same time threshold. The timeout proportion of two medium networks is over 


70%. While the small network on MNIST only has an 8% of timeout proportion with 
the naive method, OccRos barely timeouts on every network. 


Table 2: Performance comparison between OccRos (OR) and the naive (NAI) methods 


on MNIST and GTSRB under different occlusion sizes. 


MNIST GTSRB 
FNNs | Small FNN | Medium FNN | Large FNN | Small FNN |Medium FNN| Large FNN 
Size | OR NAI | OR NAI | OR NAI OR NAI OR NAT! OR NAI 
1x1 |46.44 63.12}110.18 759.93|206.50 TO |29.76 472.23]69.28 989.08|173.62 TO 
2x2 |49.62 165.53} 98.60 832.98/199.17 TO |21.04 340.89/42.16 680.81/103.42 TO 
3x3 [51.23 298.59/111.14 863.74|205.67 TO | 11.93 169.35)32.00 499.31] 81.17 TO 
4x4 |44.78 256.22]115.99 886.73|}225.02 TO | 8.90 141.85)31.24 419.62/106.41 TO 
5x5 |48.96 270.23|113.01 803.40/264.79 TO | 6.11 190.81/27.97 418.56/118.99 TO 
6x6 /47.81 318.28]127.98 642.01/288.18 TO | 7.49 213.35)21.70 282.04} 60.02 TO 
7X7 |34.99 357.78] 124.47 589.41/222.65 TO | 6.02 153.81/31.96 404.18} 62.60 TO 
8x8 |36.05 324.34]129.27 469.24/215.53 TO | 5.99 123.07|28.44 250.97) 54.37 TO 
9x9 /34.58 224.01] 141.54 375.97/219.61 TO | 6.42 102.39/31.30 160.84} 59.87 TO 
10 x 10/28.98 178.44] 78.89 398.01|182.36 TO | 6.61 127.20/28.59 153.96} 40.69 TO 


Experiment III: Effectiveness of the integrated acceleration techniques. We finally 
evaluate the effectiveness of the two acceleration techniques integrated with the tool. 
We evaluate each technique separately by excluding it from OccRos and comparing the 
verification time of OccRos and the corresponding excluded versions. Fig. 8 shows the 
experimental results of verifying the medium FNN trained on GTSRB against multiform 
occlusions by the tools. Fig. 8 (a) shows that label sorting can improve efficiency in both 
robust and non-robust cases. In particular, the improvement is more significant in the 
non-robust case, with up to 5 times speedup in the experiment. That is because solving 
each query is faster than solving all simultaneously, and further OccRos immediately 
stops dispatching queries once a counterexample is found in the non-robust case. Fig. 8 
(b) shows that occlusion space splitting can also significantly improve the efficiency, 
with up to 8 and 24 times speedups in the robust and non-robust cases, respectively. In 
addition, Fig. 8 (b) also shows a significant reduction in the number of time-outs. 


6 Related Work 


Robustness verification of neural networks has been extensively studied recently, aiming 
at devising efficient methods for verifying neural networks’ robustness against various 
types of perturbations and adversarial attacks. We classify those methods into two 
categories according to the type of perturbations, which can be semantic or non-semantic. 
Semantic perturbation has an interpretable meaning, such as occlusions and geometric 
transformations like rotation, while non-semantic perturbation means that noises perturb 
inputs with no particular meanings. 

Non-semantic perturbations are usually represented as L, norms, which define the 
ranges in which an input can be altered. Some robustness verification approaches for 
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Fig. 8: Efficiency evaluation results of the two acceleration techniques. 


non-semantic perturbations are both sound and complete by leveraging SMT [20,1] and 
MILP (mixed integer linear programming) [36] techniques, while some sacrifice the 
completeness for better scalability by over-approximation [29,2,7], abstract interpretation 
[34,10,5], interval analysis by symbolic propagation [43,42,26], etc. 

In contrast to a large number of works on non-semantic robustness verification, there 
are only a few studies on the semantic case. Because semantic perturbations are beyond 
the range of L, norms [9], those abstraction-based approaches cannot be directly applied 
to verifying semantic perturbations. Mohapatra et al. [30] proposed to verify neural 
networks against semantic perturbations by encoding them into neural networks. Their 
encoding approach is general to a family of semantic perturbations such as brightness 
and contrast changes and rotations. Their approach for verifying occlusions is restricted 
to uniform occlusions at integer locations. Sallami et al.[31] proposed an interval-based 
method to verify the robustness against the occlusion perturbation problem under the 
same restriction. Singh et al. [35] proposed a new abstract domain to encode both 
non-semantic and semantic perturbations such as rotations. Chiang et al. [4] called 
occlusions adversarial patches and proposed a certifiable defense by extending interval 
bound propagation (IBP) [12]. Compared with these existing verification approaches 
for semantic perturbations, our SMT-based approach is both sound and complete, and 
meanwhile, it supports a larger class of occlusion perturbations. 


7 Conclusion and Future Work 


We introduced an SMT-based approach for verifying the robustness of deep neural net- 
works against various types of occlusions. An efficient encoding method was proposed to 
represent occlusions using neural networks, by which we reduced the occlusion robust- 
ness verification problem to a regular robustness verification problem of neural networks 
and leveraged off-the-shelf SMT-based verifiers for the verification. We implemented 
a resulting prototype OccRos and intensively evaluated its effectiveness and efficiency 
on a series of neural networks trained on the public benchmarks, including MNIST and 
GTSRB. Moreover, as the scalability of DNN verification engines continues to improve, 
our approach, which uses them as blackbox backends, will also become more scalable. 
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As our occlusion encoding approach is independent of target neural networks, we 
believe it can be easily extended to other complex network structures, such as convo- 
lutional and recurrent ones, which only depend on the backend verifiers. It would also 
be interesting to investigate how the generated adversarial examples could be used for 
neural network repairing [41,18] to train more robust networks. 
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Abstract. Kobayashi et al. have recently proposed NEUGUS, a frame- 
work of neural-network-guided synthesis of logical formulas or simple 
program fragments, where a neural network is first trained based on 
sample data, and then a logical formula over integers is constructed by 
using the weights and biases of the trained network as hints. The previous 
method was, however, restricted the class of formulas of quantifier-free 
linear integer arithmetic. In this paper, we propose a NEUGUS method 
for the synthesis of recursive predicates over lists definable by using the 
left fold function. To this end, we design and train a special-purpose re- 
current neural network (RNN), and use the weights of the trained RNN 
to synthesize a recursive predicate. We have implemented the proposed 
method and conducted preliminary experiments to confirm the effective- 
ness of the method. 


1 Introduction 


Kobayashi et al. [12] have recently proposed a framework called Neural-Network- 
Guided Synthesis (NEUGUS) for the synthesis of quantifier-free logical expres- 
sions over integer variables, which may also be viewed as simple program expres- 
sions over integer variables. Given sample data (also called training data below, 
which consist of positive/negative samples and implication constraints [6] such 
as “if dı is a positive sample, so is dz, but it is unknown whether dı is indeed 
a positive sample), NEUGUS first trains a feed-forward neural network with re- 
spect to the sample data, and then constructs a logical expression on integers 
(more precisely, a Boolean combination of inequalities on integer variables) by 
using the weights and biases of the neural network as hints. The main character- 
istic of NEUGUS is its gray-box use of neural networks. NEUGUS first trains a 
neural network, but instead of directly using the trained network as a classifier, 
it tries to construct a simple logical expression by using the trained network 
as a hint. Advantages of the gray-box approach over the white-box approach of 
using the network itself as a classifier include: (i) if successful, a simple classi- 
fier is obtained that is easier to understand (for human beings) and verify (for 
computers), and (ii) we need not worry too much about overfitting; even if the 
trained network is overfit to the given sample data, we may still be able to ex- 
tract useful information such as features important for the classification, and 
use them to construct a simple classifier. Kobayashi et al. [12,13] have applied 
the proposed framework to automated program verification, where NEUGUS is 
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used to find program invariants, and also to program synthesis where, given a 
program sketch containing holes called oracles, NEUGUS is used to find program 
expressions to fill the holes. 

In this paper, we extend NEUGUS to enable the synthesis of recursive pred- 
icates over Booleans, integers, and lists of Booleans, and lists of integers from 
positive /negative samples and implication constraints. For example, in the case 
of the synthesis of a sortedness predicate, the extended NEUGUS (henceforth, 
simply called NEUGUSR) takes as inputs sample data like: 


sorted(([1;3;4]) sorted([2;5;6;7]) —sorted((3;1;4]) —sorted([5; 2; 7; 6]) 
sorted([1;3;5]) > sorted((1; 3; 5; 6]) 


Here, sorted([1;3;5]) = sorted([1;3;5;6]) means that if sorted([1;3;5]) is 
true, so is sorted(([1; 3; 5; 6]). The goal of the synthesis is to construct a recursive 
program that satisfies the constraints specified by the sample data. In the case of 
the above example, we aim to construct a program (written in OCaml language: 
https://ocaml.org/) like: 


let sorted 1 = 
let rec sorted_aux 1 b r = 
match 1 with [] -> b 
| x::1? -> sorted_aux 1’ (b && r <= x) x 
in sorted_aux 1 true 0 


Here, the Boolean argument b of the auxiliary function sorted_aux denotes 
whether the elements of the list read so far are sorted (in the ascending order), 
and the integer argument r keeps the last element read (which is initially set 
to 0; hence, the function sorted judges the sortedness of a list consisting of 
non-negative integers), to compare it with the next element. The recursive pro- 
grams constructed with our method are restricted to those definable by using 
the left fold function. Note that the function sorted above can be expressed as 
foldl (A(b,r).Ax.(b Ar < x,x)) (true,0) using the left fold function foldl.' 

To synthesize recursive predicates, we first train a recurrent neural network 
(RNN), and construct a recursive program like the one above by using, as hints, 
the weights of the RNN and information about the executions of the RNN for 
the training data. We have designed a special-purpose RNN for that purpose, 
with the synthesis of recursive programs in mind. Figure 1 shows the overall 
structure of our RNN. The RNN has two kinds of inputs: Boolean lists and 
integer lists (where their elements are read one by one), and a Boolean output. 
The inputs and output correspond to those of the function to be synthesized, 
which takes m Boolean lists and n integer lists as arguments, and returns a 
Boolean value. Here, we assume that the lists are of equal length, by replicating 
integer arguments and padding short lists with dummy elements if necessary. For 


1 In fact, the program above is written so that it matches the computation of the 
left fold function. Otherwise, sorted_aux could alternatively be defined so that it 
returns false immediately when r > x holds. 
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Fig. 1. The overall structure of the special-purpose RNN 


example, if the argument of the function to be synthesized is ([1; 2; 3], 0, [1;0]), 
then the input for RNN will be ([1; 2; 3], [0; 0; 0], [1;0;—1]). The Boolean values 
true and false are respectively represented as 1 and —1. The RNN has also 
two kinds of hidden states: Booleans and integers. The Boolean hidden states are 
actually represented as numerical values, but they are constrained to range over 
[—1, 1] by using the hyperbolic tangent function tanh as the activation function 
for those values inside the feed-forward network. The details of the feed-forward 
network will be discussed later. 

After training the RNN, by using (i) the weights and biases of each link/node 
and (ii) the the input/output behavior of the trained feed-forward network as 
hints, we construct a function: 


step : B” x Z” x B? x ZË = B” x Z", 


which takes the current input (consisting of m Booleans and n integers) and the 
current values of Boolean and integer hidden states, and returns the next hidden 
states. Here, Z and B are the types of integers and Booleans respectively. We 
then construct the whole program as the one that “folds” the input lists by using 
the step function, where the base-case values correspond to the initial values of 
the hidden states; more details are discussed in later sections. Finally, we check 
whether the synthesized program conforms to the sample data and if so, output 
the program; otherwise we retrain the RNN and retry the program synthesis. 
We have implemented a program synthesis tool based on the above idea. 
We have confirmed through experiments that the tool worked reasonably well; 
our tool could successfully synthesize the sortedness predicate above, as well 
as other non-trivial predicates, including the binary predicate avge(£, n), which 
means that the average value of the elements in the list £ is no less than n. 
The rest of this paper is structured as follows. Section 2 defines the program 
synthesis problem considered in this paper. Section 3 introduces our special- 
purpose RNN. Section 4 explains how to synthesize a program from a trained 


230 N. Kobayashi and M. Wu 


RNN. Section 5 reports an implementation and experimental results. Section 6 
discusses related work and Section 7 concludes the paper. 


2 The Synthesis Problem 


This section defines the problem of program synthesis considered in this paper. 
We write B and Z for the sets of Booleans and integers respectively. For a 
set S, we write S* for the set of sequences consisting of elements of S, and 
S1 x ++- x Sp for the set of tuples of the form (v1,..., Up) with v; € S; for each 
i. We sometimes call an element of S* a list, based on the terminology used in 
programming languages, and write [a1;--- ; an] instead of a1 -+ an. 

We assume a finite set of variables called predicate variables. A signature 
maps each predicate variable to its domain of the form Tı x --- x Tk, where 
T; € {B, Z, B*, Z*}. For example, for a signature K and a predicate variable p, 
K(p) = Z* x Z means that p is a binary predicate that takes an integer list and 
an integer as arguments. 

For a signature K, we write Atomsx for the set of pairs (p, v) where v € K(p); 
we often write p(v) for (p,v). An implication constraint is a formula of the form 
ay At: Aak => bi V- V bg, where ay,...,a4%,01,...,b¢ E Atoms. Let O be 
an interpretation for predicate variables, i.e., a map that assigns a predicate 
P C K(p) to each predicate p € dom(K). We write O } p(v) if v € O(p). We 
write O — a1 A++- Aak => bi V- V be and say that O satisfies the implication 
constraint a; A++- A a, => bi V---V bg, when O E b; for some j € {1,..., 2} if 
O H a; for every i € {1,...,k}. 

The synthesis problem considered in this paper is the problem of, given a 
signature K and a set of implication constraints as input, finding (a description 
of) a predicate assignment O that satisfies all the implication constraints. As a 
description of the predicate assigned to each predicate variable, we consider the 
class of functions f defined by programs of the following form: 


let f(@:T, X:+- X Ta) = 
let rec g(y,7) = match y with 
O -> ri 
/ 


| (Uy,---,Un) sy! -> let 7” = step(uy,...,Un,7) in gly’, T) 


in gl ezip ry, x..-xT, (©), d). 


Here, % denotes a sequence z1, ..., £k, and d denotes a sequence of default integer 
or Boolean values dı,...,de, where d; is true or 0; we write dg for true and 
dz for 0.2 The function ezip is an extended “zip” function, which maps a tuple 


? The use of fixed default values slightly restricts the class of functions. In fact, the 
value of f([]) is restricted to true. To remove the restriction, it suffices to either (i) 
allow d to take other values and make them also learnable, or (ii) replace rı with 
h(rı) and make the Boolean function h also learnable. 
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consisting of lists, integers, and Booleans to a list of tuples. It is defined by: 


CZUD Ty x.x Tn (v1, sees Un) = 
[] if every v; is [], an integer, or a Boolean 
(ehdr,(v1),..., ehd7,(Un)) :: (etlr, (v1),..., etlr, (Un)) otherwise 


ehdz«([]) =—1 ehdz«(n::v) =n ehdp«([]) =false ehdg«(b:: v) =b 
ehdz(n) =n ehdp(b)=b etlz(n)=n_ etlp(b) =b 
etlz~({]) =[] etlz (n::v)=v etlg~([]) =|]  etle« (bs: v) =v. 


For example, ezipg.yg« xz ({1; 233], [2; 3], 1) = [(1, 2, 1); (2,3, 1); (8, —1, 1)]. The 
function step is the main target of the synthesis. It should be a function on 
integers and Booleans, consisting of (i) Boolean operations, (ii) affine expressions 
of the form cg + c1z1 +--+ + cpap and (iii) inequalities of the form e < 0, where 
e is an affine expression. The function g above can also be expressed as 


AT.#1(foldl step’ (d) (ezipr, x...xr, (£))); 


where foldl is the left fold function, step’ is the curried version of step, and #1 
denotes the projection of a tuple to its first element. 

In the case of the sortedness predicate discussed in Section 1, T} = Z* with 
k = 1, the length |Z| of the auxiliary parameters of g is 2, and step: Z x Bx Z > 
B x Z is given by step(u, rp, ri) = (ra A (ri < u), u). 

For the predicate avge mentioned in Section 1, T) = Z* and T> = Z with 
k = 2, and step: Zx Z x Bx Z > B x Z is given by step(u1,u2,ro,7:1) = 
(ri tu, — uz > 0,7; +u — U2). Here, during the computation of avge(l,m), the 
parameter z accumulates the sum of 4; — m (where £; is the i-th element of £). 
Whether the average of the elements of £ is no less than m can be determined 
by checking whether the final value of z is no less than 0. 

Our synthesis problem subsumes the problem of learning automata (which is 
obtained as a special case, where the signature consists of a single predicate p: 
(B*)™ and step:B™*” — B”; input symbols and states are encoded as elements 
of B” and B” respectively) and also that of symbolic automatic relations [19]. 
In fact, the automatic synthesis of symbolic automatic relations was one of the 
motivations behind the present paper, as explained below. 

The motivations for the synthesis problem above come from automated pro- 
gram verification and synthesis. For automated program verification, we have 
CHC-based program verification [1] in mind, where various program verification 
problems are reduced to the satisfiability problem for Constrained Horn Clauses 
(CHCs). For programs using lists, the CHCs obtained by the reduction involve 
predicates over lists, but the current CHC solvers [10,14,2] are not very good at 
solving such CHCs. A solver for the synthesis problem above can be used as an 
important component in a CHC solver [2,4] based on the ICE-learning frame- 
work [6], to synthesize a candidate solution for CHCs involving lists. Another 
application is the oracle-based programming mentioned in Section 1, whose goal 
is to synthesize code fragments to fill the holes of a given program pattern. 
By solving the synthesis problem above, we can automatically synthesize code 
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Fig. 2. The feed-forward network inside the RNN 


fragments that involve recursive computation over lists. The roles of implication 
constraints in those applications are explained in [2,13]. 

In both of the applications above, the validity of a synthesized program is 
determined based on the whole verification or synthesis goal (in the case of 
verification, a synthesized predicate over lists is valid if it is indeed a solution for 
the CHC satisfiability problem). Thus, in the actual applications, the synthesis 
problem defined above needs to be repeatedly solved with the set of sample data 
being gradually expanded, until the end goal of program verification or synthesis 
is achieved. 


3 The Design and Training of the RNN 


This section describes the design of our special-purpose recursive neural network 
(RNN) tailored for our synthesis problem, and how to train it. 


3.1 The Architecture of the RNN 


The overall structure of the RNN is as already depicted in Figure 1. The structure 
of the feed-forward (FF) network inside the RNN is shown in Figure 2. The 
network consists of four layers of nodes, where the first layer (the leftmost one) 
consists of input nodes of the FF network, which hold the input values and 
hidden state values of the whole RNN, and the fourth layer (the rightmost one) 
consists of output nodes of the FF network, which hold the next states of the 
RNN. The nodes of the diamond shape take values in the range [—1, 1] (either 
by the assumption on inputs or by the use of tanh as the activation function), 
and those of the circle shape take arbitrary floating point numbers. The value 
of each diamond-shaped node is computed by tanh(b + wıxı +--+: + wkk) and 
that of each circle node is computed by b+ w1£ı +---+ wer, where the bias b 
and the weight w; vary for each node and link. Each ® node in the fourth layer 
has exactly two inputs x and y, and outputs zty, where x is the output of the 
diamond-shaped node. 
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The part of the FF network to compute the diamond-shaped nodes in the 
fourth layer is analogous to the network in the previous NEUGUS framework [12] 
for the synthesis of logical formulas. Each diamond-shaped node in the second 
layer, whose output is tanh(b+ w121+---+w,2z), is intended to recognize linear 
inequalities of the form co + c1£1 +--+ Ck£p > d where |d| is a small integer, and 
ci /Co = w;/b. The idea is that the value of the node tanh(b+w 121 +: -+Wwktk) = 
tanh((b/co) : (co + c1£1 +--+ + Ck£p)) is close to —1 or 1 when both |b/co| and 
|co +c1£1 +: --+Ckzk| are large, so that the node carries only information about 
whether co + cız +-+++cx2~ > d holds for each d such that |d| is small. The 
diamond-shaped nodes in the third and fourth layers are intended to compute 
the Boolean combinations of those linear inequalities and Boolean inputs/hidden 
states. 

The rest of the FF network, for computing the ®@-nodes in the fourth layer, 
is intended to compute conditional expressions of the form 


if b then co +c, 21 +--+- + cprp else 0, 


where b is a logical combination of linear inequalities and Boolean inputs/hidden 
states. Each circle node in the second layer compute the part co+c1£1 +: -*+Ck£k, 
each node in the lower group of the third layer computes the Boolean value b, and 
each ®-node emulates the conditional expression. The idea is that the Boolean 
value b is actually represented as a value in [—1, 1] where values close to —1 and 1 
are respectively Rea as false and true. Thus, et (co +¢1%1+:+-+CK2x) 
is close to Co +c, 41 +--+: +cC,%x,% when b represents true, ond it is close to 0 when 
b represents false. Note that the general conditional if b then e; else eg can 
be expressed by (if b then e; else 0) + (if =b then ez else 0) = HH e1 + H e9, 
which can be computed in the next cycle if we have hidden states hat correspond 
to if b then e; else 0 and if —b then es else 0. 


Remark 1. As explained above, the internal structure of our RNN is specialized 
for the purpose of solving our synthesis problem, and quite different from other 
popular RNNs. The ®-node is a reminiscent of a multiplicative gate of LSTM [9], 
but its main role is to emulate a conditional expression, rather than to address the 
problems of conventional RNNs such as the gradient vanishing problem. In fact, 
we do not expect that our RNN scales for very long lists. Fortunately, however, 
training data with short lists would often suffice for our synthesis problem. 


3.2 Training the RNN 


Let R be the set of real numbers and g € [—1,1]"*™ x R™+* — [-1,1]” x R? be 
the function computed by the FF network. The function f € ([~1,1]” x R”)* 
[—1, 1] computed by the whole RNN is defined by: f(¢) = f'(1,£,0), where: 


f'(b1,---,bns{],0) =b f’'(0,0 2 2,2) = ELl) where @, Z) = gO, 2,2). 


Here, f’ € [-1,1]* x ([-1, 1" x R)* x R? > [-1, 1]. 


234 N. Kobayashi and M. Wu 


For an atom p(v, w) with v € B™ and Ù € Z”, we write Op@,a) for f(ot,w), 
where true’ = 1 and falset = —1. For an implication constraint a, ^+- -Aak => 
bi V---Vbe, we define the loss lossq, j...ja,=b,V---vb, for the implication constraint 
by:? 


140,, 1—0», 
l08S5a1^--Aap=>b1 V Vbe = Wetec 5 Ijeu, g 


Note that lossa,p..-Aagp=>b1V--Vb, 1S 0 just if one of the a;’s is false or one of 
the b;’s is true, which matches the meaning of the implication constraint. For 
a set C = {71, . --, Yp} of implication constraints, the overall loss is defined by: 
losso := Diesi,...,p} (0884, Using the loss function above, we train the RNN 
with a gradient descent method. 


Adjusting the loss function. The diamond-shaped nodes in Figure 2 are intended 
to hold Boolean values (which correspond to 1 and —1), but those nodes in the 
actual RNN trained by using the above loss function may take values close to 
0, which cannot be interpreted as true or false. That is problematic during 
the program synthesis, because the behavior of the RNN may deviate too much 
from that of an ordinary program to be synthesized. To remedy the problem, 
we also use a modified version of the loss function, obtained by replacing Oa in 


the basic loss function above with Of := Oa- J]; o where à > 0 (note that 
the modified loss function coincides with the basic loss function when » = 0), 
and v; is the value of a diamond-shaped node in the second or fourth layer of 
the FF network in Figure 2. This penalizes the use of “non-Boolean values” in 
diamond-shaped nodes. Note that if v; cannot be interpreted as true or false, 
i.e., if |v;| is close to 0, then ae is much smaller than 1; thus, |O/,| would also 


be much smaller than 1, causing a large loss. 


4 Synthesis Based on the Trained RNN 


This section discusses how to construct the function step in Section 2, by using 
the trained RNN as a hint. From the trained RNN and its runs for training data, 
we gather and use the following information. 


— The weight and bias of each link and node in the FF network. 
— A collection of the inputs given to the FF network and the corresponding 
outputs of each node. 


The output of the function step consists of Booleans and integers. We first 
discuss how to construct the integer part. The integer part of step corresponds 


3 This loss function is different from the one used in [12]. The difference is partly 
due to the encoding of Boolean values; Kobayashi et al. [12] used 0 for false while 
we use —1. Another difference is the use of log vs squared loss. We preferred the 
latter for simplicity, but more experiments are necessary to tune the shape of the 
loss function. 
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to the ®-nodes of the FF-network in Figure 2, whose values are computed by a 
function of the form: 


I(T ams Um UniS hak) = 


B(T m OL cca Ù. n S1,...k) & (bo I Dieten Wii + eile wi; 83); 


where 7, v, ù, and 3 respectively represent the hidden Boolean states, Boolean 
inputs, integer inputs, and hidden integer states; the function B is the output 
of a node in the lower half in the third layer in Figure 2; the part bọ + --- is the 
output of a circle node in the second layer; and z 8 y = oily as defined before. 

Since the value of I is bo + X jeti... n} Witi + Vyeqa,....ny Wj Ys if the value 
of B is 1, and 0 if the value of B is —1, one may be tempted to construct the 
corresponding program expression as: 


if yp then by + Dicti, n} Wiwi + DFE Ly. wis; else 0, 


where yp is a Boolean expression corresponding to B. That is problematic, 
however, because we wish to construct an integer program expression, but the 
weights and bias (w;, w}, bo) may be arbitrary floating point numbers. We thus re- 
scale the coefficients wi, wi, and bo as follows. We first pick integers co, ¢1,.--, Cn 
and a real number r so that rbo, rw 1,...,7rWn are close to Co, C1,.-., Cn. For wi; 
we just pick an integer c} close to w}, and prepare the integer expression: 


if yp then co + Deis ik Ciu; + jE 1y...5k} cs; else 0, 


and use it as the integer-part of the function step. 

Before constructing Boolean expressions (including yp), we adjust (i) the 
hidden integer states in the run history of RNNs for training data and (ii) the 
weights for the hidden integer nodes accordingly, to reflect the re-scaling of 
the coefficients for computing hidden integer states. We multiply (i) with r, and 
divide (ii) by r. To see the need for the adjustment, let us recall the step function 
for the sortedness: 


step(u, Tp, ri) = (To A (ri < u), u). 
The RNN may actually learn the following function: 
step(u, rb, ri) = (ra A (2r; < u), 0.5u). 


Suppose we have re-scaled 0.5u to u, to make the coefficient an integer. That 
would increase the value of the hidden integer state by a factor of 2, so that the 
coefficient of r; in the inequality 2r; < u should be decreased by half, to obtain 
ri < u. We can thus obtain 


step(u, rb, Ti) = (ra A (ri < u), u) 


correctly. 
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Table 1. The value of each node of the FF-network for [2; 3; 5]. 


Before re-scaling. After re-scaling. 
lst layer]2nd layer |3rd layer|4th layer] |1st layer|2nd layer|3rd layer|4th layer 
1.000 0.996 0.936 0.999 1.000 0.996 0.936 0.999 
2.000 0.997 0.994 2.000 0.997 0.994 
0.000} -0.999 0.975 2.231 0.000} -0.999 0.975 1.978 
0.992} -0.969 0.992} -0.969 
2.235 0.998 1.982 0.998 
0.999 1.000 0.936 0.999 0.999 1.000 0.936 0.999 
3.000 0.977 0.992 3.000 0.977 0.992 
2.231] -1.000 0.967 3.256 1.978] -1.000 0.967 2.888 
0.924} -0.969 0.924} -0.969 
3.262 0.998 2.893 0.998 
0.999 1.000 0.936 0.999 0.999 1.000 0.936 0.999 
5.000 0.998 0.994 5.000 0.998 0.994 
3.256} -1.000 0.975 5.463 2.888} -1.000 0.975 4.844 
0.995} -0.969 0.995} -0.969 
5.472 0.998 4.853 0.998 


Example 1. As a concrete example, consider the synthesis of a sortedness predi- 
cate sorted, which takes a list £ and returns whether £ is sorted in the ascending 
order. We set h = n = k = 1, and m = 0. The numbers of hidden nodes in 
the upper-half of the second layer and those in the upper-half of the third layer 
were both set to 4. We have trained the network by using 200 positive samples 
(like = sorted((2;3;5])) and 94 negative samples (like sorted((9;8]) =>). After 
the training, we re-ran the RNN for the training data, and collected the value 
of each node of the FF-network. For example, for the data [2;3;5], we obtained 
the information shown on the left-hand side of Table 1. Here, the first group 
(separated by the horizontal line), shows the values of the nodes for the first 
element 2 of the list, and the second group shows those for the second element 
3. We also look at the weights and biases of the FF-network to synthesize the 
target function step. 


By inspecting the weights and bias for the the circle node in the second layer, 
we can find that the function computed by the node is: —0.023+ 1.128u—0.045s, 
where u and s respective denote the values of the integer input and the hidden 
integer state. The ratio between the constant and the coefficient of u is about 
0 : 1, and the co-efficient of s is close to 0. Thus, we set the integer expression 
to compute the next hidden integer state to if yg then u else 0, where the 
condition yp is yet to be synthesized. 


The replacement of —0.023 + 1.128u — 0.045s with u results in the decrease 
of the value of the hidden integer state by a factor of 1/1.128, as shown on the 
right-hand side of Table 1. The weights for the nodes in the second layer are also 
accordingly re-scaled. 
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It remains to construct Boolean expressions, consisting of linear inequalities 
on integer variables and Boolean variables. That can be achieved in a manner 
similar to [12]; we have, however, adopted the following procedure, which uti- 
lizes information about the value of each node in the FF network. In contrast, 
Kobayashi et al.’s method [12] uses only the weights and biases, in addition to 
the input and output for each training data; they did not utilize the values of 
internal nodes for each training data. 

We synthesize linear inequalities corresponding to the diamond-shaped nodes 
in the second layer as follows. Let 


tanh(bo + wit +-+: + WnUn + Wn4181 + +++ + Wntrsn) 


be the value computed by a diamond-shaped node in the second layer (where 
we assume that the weights wn41,...,Wn+k have already been re-scaled). Let 
Co,C1,--+;Cn+k be integers whose ratios are close to those of bo, w1, ..., Wn+k- 
Then we set the corresponding inequality to 


Co + C1 +++ + Cn Un + Cn4181 + +++ + CntkSk > €, 


where e € {—1,0,1} is chosen so that the truth value of the inequality best- 
matches the actual input-output behavior of the node for training data; recall 
the discussion in Section 3.1. 

Next, we construct Boolean functions corresponding to the diamond-shape 
nodes in the fourth layer and the lower-half of the third layer in Figure 2. This is 
performed by first constructing the truth tables for those functions based on the 
runs of the RNN for the training data, and then using a method for Boolean deci- 
sion tree construction |7|,* where Boolean variables and the inequalities synthe- 
sized above are used as qualifiers (i.e., atomic predicates that constitute Boolean 
functions). Those qualifiers are prioritized based on the weights for the nodes in 
the third and fourth layers. The synthesized functions may not completely match 
the truth tables if appropriate inequalities have not been found in the previous 
step. Even so, we proceed to the next step to construct the step function and test 
it; recall that in our gray-boz use of the neural network, the internal behavior of 
the synthesized program need not completely match that of the RNN. 


Example 2. Recall Example 1. The next step is to synthesize linear inequalities 
from the (re-scaled) weights of the nodes in the second layer. After the re-scaling 
of weights, the functions computed by the diamond-shaped nodes are: 


tanh(1.396 + 0.876u + 1.182s) tanh(1.066 + 1.084u — 1.052s) 


Based on the ratios between the constant and coefficients, we synthesize linear 
inequalities of the form: 


4+ 3u + 4s > e1 l+u-—s>e — 6 — 4u — 3s > e3 u— s> ez. 


4 Kobayashi et al. [12] suggested using the Quine-McClusky method for this purpose, 
but we prefer the Boolean decision tree construction for two reasons. First, the 
Quine-McClusky method would not scale when the dimension is large. Second, we 
wish to give priorities to some qualifiers as explained below. 
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We then check the re-scaled trace information (such as the one in Table 1, but 
including the trace information for all the training data), we choose appropriate 
values for each e;. In the present case, we obtain: 


4+ 3u+4s >0 l+u-s>-l 7—3u—4s>0 u-—s>-—l. 


It remains to synthesize Boolean functions. To this end, for each diamond- 
shaped node in the fourth layer and in the lower-half of the third layer, we 
construct a truth table, where the inputs are Boolean values obtained by dis- 
cretizing the values of the diamond-shaped nodes in the first and second layers. 
For example, from Table 2, we obtain the following truth table for the diamond- 
shaped node in the fourth layer. The duplicated rows can be removed before the 
synthesis of a logical function. 


input output 
Lo I, Ig Ts I4 O 
true|true|true|false|true|true 


true|true|true|false|true|true 


Here, Io corresponds to the value of the hidden Boolean node, and I—I4 cor- 
respond to the diamond-shaped nodes in the second layer, which represent in- 
equality constraints extracted above. We interpret values close to 1 (say, those 
greater than 0.5) as true, and those close to —1 (say, those less than —0.5) as 
false, ignoring the other values. 

Once a truth table has been constructed, we can apply a classical method 
to synthesize a logical function that conforms to the truth table. In our imple- 
mentation, we have employed a technique for Boolean decision tree construction; 
instead of computing the entropy |7], however, we have prioritized Boolean in- 
puts (Jo—J4, in the above case) based on the weights for the nodes in the third 
and fourth layer, which indicate which Boolean inputs affected the output node. 

Suppose that the logical function O = Ip A I4 has been synthesized in the 
above example. Suppose also that the constant function true has been synthe- 
sized for the diamond-shaped node in the third layer. Since Io corresponds to the 
hidden Boolean state, and I4 corresponds to the inequality u — s > —1 (which 
is equivalent to s < u), we obtain 


step(u, Ta, S) = (r A (s <u), if true then u else 0) 


as the step function. 


By combining the procedures above, we can construct the function step. After 
constructing the step function, we test the synthesized recursive function against 
training data, and check whether the outputs of the synthesized function satisfy 
all the implication constraints. If some constraints are not satisfied, we re-train 
the RNN and repeat the synthesis procedure above. To avoid the re-training of 
the RNN from scratch, however, we first fix the part for computing the hidden 
integer states. This is because the process of re-scaling the parameters for the 
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hidden integer states as explained above is costly and error-prone. Upon repeated 
failures, however, we reset all the parameters of the RNN and re-train it from 
scratch. 


5 Implementation and Experiments 


We have implemented a tool called NEUGUSR for the synthesis of recursive pred- 
icates based on the method described above in OCaml using the machine learn- 
ing framework ocaml-torch (https://github.com/LaurentMazare/ocaml-torch), 
which is an OCaml interface for the PyTorch library. Our tool is available at 
https://github.com/naokikob/neugusR. This section describes the experiments 
we conducted that confirm the effectiveness of our approaches. 


All the experiments below were conducted on a laptop computer with In- 
tel(R) Core(TM) i5-8265U CPU (1.60GHz) and 8 GB memory. Training was 
done using only CPU. 


5.1 Dataset and predicates 


We have prepared 11 recursive predicates over integer lists and integers for syn- 
thesis. Examples include predicates such as max(l,n) which says the largest 
element of l is n, sumle(l;,l2) which says the sum of lı is less than or equal 
to the sum of l2, and predicates sorted(/) and avge(l,n) as already described 
in Section 1. 


For experiments, we consider positive constraints (of the form = a, where 
ap E Atoms, and K is the corresponding signature of the predicate), negative 
constraints (of the form bẹ =), as well as general implication constraints as 
defined in Section 2. 


For each problem (predicate), we performed 3 runs to see if the solver was 
able to synthesize a program that matches all the training examples. We set 
the time limit of each run to 1200 seconds. In each run, the neural network is 
trained for 30000 steps by default. At each step, all the training examples of the 
predicate were used to optimize the neural network. In each run, the training 
was terminated early if the accuracy reached 100% on the training examples and 
the loss was less than a threshold, which in the current setting is 1076. 


If the accuracy did not reach 100% within 30000 steps or there are constraints 
not satisfied by the synthesized program, the training was set to restart with 
fresh parameters except for the weights of the hidden integer states. If there are 
three consecutive failures of convergence, however, we reset all the parameters 
and restart training from scratch. 

We used the Adam optimizer [11] for training with the default setting of 
ocaml-torch (3, = 0.9, 82 = 0.999 without weight decay), and the learning rate 
was 0.001. Learned parameters are not shared between different problems. 
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5.2 Evaluation 


The specification of RNN used for each problem is as follows. For all the pred- 
icates other than updown and max, we used 4 nodes for the second layer of the 
RNN, 8 nodes for the third layer of the RNN, and 1 node each for the integer 
hidden state and the boolean hidden state. For updown and max, we used 2 nodes 
for the boolean hidden states and 16 nodes for the third layer. For max, we also 
used 8 nodes for the second layer instead of 4. 

We report the performance of our tool NEUGUSR with respect to the fol- 
lowing metrics. 


— retry: the total number of retries. For each run, up to 10 retries were 
allowed within the time limit. There can be 3 x 10 retries for each problem 
in total in the worst case. 

— success: the number of runs in which a program that correctly classifies 
the positive and negative examples was constructed. 

— time: the average execution time per run. The execution time includes the 
whole process for training the RNN and synthesizing/testing a program, 
though it was dominated by the time for training the RNN. 


Table 2 shows the performance of NEUGUSR for each predicate. It can be seen 
that NEUGUSR was able to solve all the problems consistently, with the only 
exception of max which failed once due to a timeout. The small number of retries 
triggered during the synthesis of each predicate suggests that our approach is 
effective. Our RNN was able to classify the positive and negative examples very 
well, because otherwise multiple restarts of training would have been forced 
even before entering the extraction phase. Our extraction procedure was also 
reasonably accurate — while errors could occur, they were quickly fixed within 
a few retries (3 on average as can be seen in Table 2). 


Table 2. Performance on the predicates to be synthesized. 


Predicate # retry # success (out of 3) time (s) 


sorted (/) 0 3 171.2 
sortedrev (l) 1 3 217.5 
stairge (l) 5 3 560.6 
allge (J, n) 1 3 272.3 
allle (l, n) 1 3 355.1 
somege (l, n) 1 3 376.2 
avge (l, n) 8 3 571.2 
listle(l1, l2) 0 3 214.4 
sumle (l4, l2) 2 3 241.4 
updown (l) 1 3 226.1 
max(l, n) 7 2 557.6 


The predicate max is the only predicate that involves equality among the 
11 predicates, which probably explains why it is the most difficult one. The 
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fact that max can be synthesized was more of a surprise which demonstrated 
the generality of our approach to some extent. While our framework was not 
designed specifically to handle equalities, the neural network, if lucky, might 
still be able to find clever ways to express equalities using inequalities. This is 
one of the reasons we specified 8 nodes for the second layer when dealing with 
max — the more inequalities we have, the more likely a combination of them 
happens to express certain equality. 


Remark 2. We could not find any previous tool that can be directly compared 
with ours. A possible alternative approach to our synthesis problem would be to 
prepare a template for the step function, generate constraints on parameters in 
the template, and use an SMT solver to solve them. 


6 Related Work 


As already mentioned, the present work may be considered an extension of 
Kobayashi et al.’s NEUGUS framework [12], where feed-forward neural networks 
are used as gray-boxes to synthesize formulas of quantifier-free linear integer 
arithmetic. We have significantly expanded the scope of NEUGUS, by enabling 
the synthesis of recursive predicates on lists; to that end, we have employed 
special-purpose recursive neural networks. 

Our work has been partially motivated by Shimoda et al.’s work on an ex- 
tension of symbolic automata called symbolic automatic relations (SARs) [19]. 
They introduced SARs to express recursive predicates on lists, and used them 
to express loop invariants on lists (more precisely, to express candidate solu- 
tions for the CHC satisfiability problem [1]) for automated verification of list- 
manipulating programs. They left it to future work how to automatically infer 
SARs from positive, negative, and implication constraints. Our work fills that 
gap, since the class of programs synthesized in our framework corresponds to 
their SARs (more precisely, ©’}°'-formulas [19]). Further refinement and opti- 
mizations would be, however, required for our tool to be effectively used in that 
context. 

Our work is also related to neural network-based approaches to the synthesis 
of finite automata [16,21]. Our method deals with a much wider class of programs 
involving integers and integer lists. Also, the problem setting is slightly different; 
Weiss et al.’s method [21] takes a trained RNN as the ground truth, and aims to 
construct an automaton whose behavior matches that of the RNN. In contrast, 
in our approach, we allow the behavior of the synthesized program and that of 
the RNN to be different for inputs other than those given as training data. This is 
because in the NEUGUS framework, the trained RNN is supposed to be used just 
as a hint, and does not necessarily provide the ground truth. The ground truth 
is determined from the whole verification or synthesis goal [12,13], as discussed 
at the end of Section 2. In the context of program verification, the synthesized 
predicate is used as a candidate program invariant, and it is checked whether it 
is indeed an inductive invariant; if not, then new training data are added and 
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NEUGUS should be repeated. In the context of oracle-based program synthesis, 
the synthesized function is used as a component of the whole program, and then 
it is checked whether the whole program satisfies a specification; if not, then new 
training data for the function are generated and NEUGUS should be repeated. 
Recently, the above line of work has also been further extended to infer weighted 
automata [22,15] and context-free grammars [23], which are incompatible with 
the class of programs synthesized by our method. 

There have been studies of other approaches to program synthesis based 
on neural networks, most notably, those based on transformers [3,18,17]. Both 
the problem settings and approaches (the ways in which neural networks are 
used) are quite different between those studies and our work. Our goal is to syn- 
thesize programs from positive/negative/implication constraints (where those 
constraints are added as necessary in the whole loop of program verification or 
synthesis), and it is not clear to us how to effectively apply transformers-based 
approaches to program synthesis for that purpose. Whilst the transformers-based 
approaches can in principle be used for our program synthesis problem, huge 
training data (which consist of pairs of positive/negative/implication constraints 
and a program that satisfies the constraints) would be required and they might 
not work well for the synthesis of unseen programs. Other neural network-based 
approaches include that of AlphaTensor [5], which used deep reinforcement learn- 
ing to discover new matrix multiplication algorithms. 

The synthesis of predicates from positive/negative samples (but without im- 
plication constraints) is an instance of the well-studied problem of programming 
by ecamples (PBE). PBE has been successful especially in the synthesis of string- 
to-string functions in DSL [8], and machine learning has also been recently ap- 
plied [20]. To our knowledge, however, the synthesis of recursive functions has 
not been much studied in that context. 


7 Conclusion 


We have proposed a novel approach to automated synthesis of recursive predi- 
cates on lists, as an extension of Kobayashi et al.’s neural-network-guided synthe- 
sis (NEUGUS) [12]. We have designed a special-purpose recursive neural network 
and devised a method to synthesize a recursive predicate by using the trained 
network as a hint. We have implemented a synthesis tool based on the method 
and confirmed that the tool works reasonably well for various examples. We plan 
to further refine the tool and deploy it in the context of automated verification 
of list-manipulating programs [19] and oracle-based program synthesis [13]. We 
also plan to extend the method to enable the synthesis of a larger class of recur- 
sive programs, including more general list-processing programs that go beyond 
the “fold” functions, and tree-processing programs. 
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Abstract. Complementation of nondeterministic Büchi automata (BAs) is an 
important problem in automata theory with numerous applications in formal veri- 
fication, such as termination analysis of programs, model checking, or in decision 
procedures of some logics. We build on ideas from a recent work on BA deter- 
minization by Li et al. and propose a new modular algorithm for BA complemen- 
tation. Our algorithm allows to combine several BA complementation procedures 
together, with one procedure for a subset of the BA’s strongly connected compo- 
nents (SCCs). In this way, one can exploit the structure of particular SCCs (such 
as when they are inherently weak or deterministic) and use more efficient special- 
ized algorithms, regardless of the structure of the whole BA. We give a general 
framework into which partial complementation procedures can be plugged in, and 
its instantiation with several algorithms. The framework can, in general, produce a 
complement with an Emerson-Lei acceptance condition, which can often be more 
compact. Using the algorithm, we were able to establish an exponentially better 
new upper bound of O(4”) for complementation of the recently introduced class 
of elevator automata. We implemented the algorithm in a prototype and performed 
a comprehensive set of experiments on a large set of benchmarks, showing that 
our framework complements well the state of the art and that it can serve as a basis 
for future efficient BA complementation and inclusion checking algorithms. 


1 Introduction 


Nondeterministic Biichi automata (BAs) [8] are an elegant and conceptually simple 
framework to model infinite behaviors of systems and the properties they are expected 
to satisfy. BAs are widely used in many important verification tasks, such as termination 
analysis of programs [30], model checking [54], or as the underlying formal model of 
decision procedures for some logics (such as S1S [8] or a fragment of the first-order 
logic over Sturmian words [31]). Many of these applications require to perform comple- 
mentation of BAs: For instance, in termination analysis of programs within ULTIMATE 
AUTOMIZER [30], complementation is used to keep track of the set of paths whose ter- 
mination still needs to be proved. On the other hand, in model checking and decision 


5 Here, we consider model checking w.r.t. a specification given in some more expressive logic, 
such as S1S [8], QPTL [50], or HyperLTL [12], rather than LTL [44], where negation is simple. 
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procedures of logics, complement is usually used to implement negation and quantifier 
alternation. Complementation is often the most difficult automata operation performed 
here; its worst-case state complexity is O((0.76n)") [48,2] (which is tight [55]). 

In these applications, efficiency of the complementation often determines the overall 
efficiency (or even feasibility) of the top-level application. For instance, the success of 
ULTIMATE AUTOMIZER in the Termination category of the International Competition 
on Software Verification (SV-COMP) [51] is to a large degree due to an efficient BA 
complementation algorithm [6,11] tailored for BAs with a special structure that it often 
encounters (as of the time of writing, it has won 6 gold medals in the years 2017—2022 
and two silver medals in 2015 and 2016). The special structure in this case are the so- 
called semi-deterministic BAs (SDBAs), BAs consisting of two parts: (i) an initial part 
without accepting states/transitions and (ii) a deterministic part containing accepting 
states/transitions that cannot transition into the first part. 

Complementation of SDBAs using one from the family of the so-called NCSB algo- 
rithms [6,5,11,28] has the worst-case complexity O(4”) (and usually also works much 
better in practice than general BA complementation procedures). Similarly, there are 
efficient complementation procedures for other subclasses of BAs, e.g., (i) determinis- 
tic BAs (DBAs) can be complemented into BAs with 2n states [35] (or into co-Biichi 
automata with n + 1 states) or (ii) inherently weak BAs (BAs where in each strongly con- 
nected component (SCC), either all cycles are accepting or all cycles are rejecting) can be 
complemented into DBAs with O(3”) states using the Miyano-Hayashi algorithm [42]. 

For a long time, there has been no efficient algorithm for complementation of BAs 
that are highly structured but do not fall into one of the categories above, e.g., BAs 
containing inherently weak, deterministic, and some nondeterministic SCCs. For such 
BAs, one needed to use a general complementation algorithm with the O((0.76n)") (or 
worse) complexity. To the best of our knowledge, only recently has there appeared works 
that exploit the structure of BAs to obtain a more efficient complementation algorithm: 
(i) The work of Havlena et al. [29], who introduce the class of elevator automata (BAs 
with an arbitrary mixture of inherently weak and deterministic SCCs) and give a O(16") 
algorithm for them. (ii) The work of Li et al. [37], who propose a BA determinization 
procedure (into a deterministic Emerson-Lei automaton) that is based on decomposing 
the input BA into SCCs and using a different determinization procedure for different 
types of SCCs (inherently weak, deterministic, general) in a synchronous construction. 

In this paper, we propose a new BA complementation algorithm inspired by [37], 
where we exploit the fact that complementation is, in a sense, more relaxed than de- 
terminization. In particular, we present a framework where one can plug-in different 
partial complementation procedures fine-tuned for SCCs with a specific structure. The 
procedures work only with the given SCCs, to some degree independently (thus reducing 
the potential state space explosion) from the rest of the BA. Our top-level algorithm then 
orchestrates runs of the different procedures in a synchronous manner (or completely 
independently in the so-called postponed strategy), obtaining a resulting automaton with 
potentially a more general acceptance condition (in general an Emerson-Lei condition), 
which can help keeping the result small. If the procedures satisfy given correctness re- 
quirements, our framework guarantees that its instantiation will also be correct. We also 
propose its optimizations by, e.g., using round-robin to decrease the amount of nondeter- 
minism, using a shared breakpoint to reduce the size and the number of colours for certain 
class of partial algorithms, and generalize simulation-based pruning of macrostates. 
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We provide a detailed description of partial complementation procedures for inher- 
ently weak, deterministic, and initial deterministic SCCs, which we use to obtain a new 
exponentially better upper bound of O(4”) for the class of elevator automata (i.e., the 
same upper bound as for its strict subclass of SDBAs). Furthermore, we also provide 
two partial procedures for general SCCs based on determinization (from [37]) and the 
rank-based construction. Using a prototype implementation, we then show our algorithm 
complements well existing approaches and significantly improves the state of the art. 


2 Preliminaries 


We fix a finite non-empty alphabet £ and the first infinite ordinal w. An (infinite) 
word w is a function w: w — È where the i-th symbol is denoted as w;. Sometimes, 
we represent w as an infinite sequence w = Wow ,... We denote the set of all infinite 
words over È as X”; an w-language is a subset of 2”. 


Emerson-Lei Acceptance Conditions. Given a set T = {0,...,k-—1} of k colours (often 
depicted as @, ©, etc.), we define the set of Emerson-Lei acceptance conditions EL(T) 
as the set of formulae constructed according to the following grammar: 


a ::= Inf (c) | Fin(c) | (@ Aæ) | (ava) (1) 


for c € I. The satisfaction relation |= for a set of colours M C I and condition a is 
defined inductively as follows (for c € T`): 


M F Fin(c) iff c ¢ M, MEq Vaz iff M F aior M Eao, 
M  Inf(c) iff ce M, MEaq, Aa iff M Ea, and M Fag. 


Emerson-Lei Automata. A (nondeterministic transition-based®) Emerson-Lei automa- 
ton (TELA) over = is a tuple A = (Q,6,1,T, p, Acc), where Q is a finite set of states, 
ô C Q XÈ xQ is a set of transitions’, I C Q is the set of initial states, I is the set of 
colours, p: 6 > 2! is a colouring function of transitions, and Acc € EL(T). We use 


p 5 q to denote that (p, a, q) € 6 and sometimes also treat 6 as a function ô: Q x È —> 
22. Moreover, we extend 6 to sets of states P C Q as 6(P,a) = Upep 0(p,a). We 
use A[q] for q € Q to denote the automaton A[q] = (Q,6, {4}, T, p, Acc), i.e., the 
TELA obtained from A by setting q as the only initial state. A is called determin- 
istic if |I| < 1 and |6(q,a)| < 1 for each q € Q anda € È. If r = {OQ} and 


Acc = Inf(@), we call A a Biichi automaton (BA) and denote it as A = (Q,6,/, F) 
where F is the set of all transitions coloured by @, i.e., F = p-'({@}). For a BA, we 
use Or (p,a) = {q € 6(p,a) | p(p = q) = {©} (and extend the notation to sets of 
states as for ô). ABA A = (Q, ô, I, F) is called semi-deterministic (SDBA) if for every 
accepting transition (p = q) € F, the reachable part of A [q] is deterministic. 

A run of A from q € Q on an input word w is an infinite sequence p: w — Q that 
starts in q and respects ô, i.e., pọ = q and Vi > 0: pi 5 Pi+ı € ô. Let infs(p) C 6 
denote the set of transitions occurring in p infinitely often and infr(p) = U{p(x) | x € 


6 We only consider transition-based acceptance in order to avoid cluttering the paper by al- 
ways dealing with accepting states and accepting transitions. Extending our approach to 
state/transition-based (or just state-based) automata is straightforward. 

7 Note that some authors use a more general definition of TELAs with 6 C Q x È x Dx Q; we 
only use them as the output of our algorithm, where the simpler definition suffices. 
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inf 5(p)} be the set of infinitely often occurring colours. A run p is accepting in A iff 
infr(p) = Acc and the language of A, denoted as L(A), is defined as the set of words 
w E€ &® for which there exists an accepting run in A starting with some state in J. 

Consider a BA A = (Q, ô, I, F). For a set of states S C Q we use As to denote the 
copy of A where accepting transitions only occur between states from S, i.e., the BA 
As =(0,6,1,F A 6] 5) where 6]. = {p 5 q € ô| p,q € S}. We say that a non-empty 
set of states C C Q 1s a strongly connected component (SCC) if every pair of states 
of C can reach each other and C is a maximal such set. An SCC of A is trivial if 
it consists of a single state that does not contain a self-loop and non-trivial otherwise. 
An SCC C is accepting if it contains at least one accepting transition and inherently weak 
iff either (i) every cycle in C contains a transition from F or (ii) no cycle in C contains any 
transitions from F. An SCC C is deterministic iff the BA (C, Slo {4}, 0) for any q € Cis 
deterministic. We denote inherently weak components as IWCs, accepting deterministic 
components that are not inherently weak as DACs (deterministic accepting), and the 
remaining accepting components as NACs (nondeterministic accepting). A BA A is 
called an elevator automaton if it contains no NAC. 

We assume that A contains no accepting transition outside its SCCs (no run can 
cycle over such transitions). We use scc to denote the restriction of 6 to transitions that 
do not leave their SCCs, formally, sce = {p = q € 6 | p and q are in the same SCC}. 
A partition block P C Q of A is a nonempty union of its accepting SCCs, and a par- 
titioning of A is a sequence P;,..., Pn of pairwise disjoint partition blocks of A that 
contains all accepting SCCs of A. Given a P;, let Ap, be the BA obtained from A by 
removing colours from transitions outside P;. The following fact serves as the basis of 
our decomposition-based complementation procedure. 


Fact 1. LCA) = L(Ap,) U...U L(Ap,) 


The complement (automaton) of a BA A is a TELA that accepts the complement 
language X°” \ L(A) of L(A). In the paper, we call a state and a run of a complement 
automaton a macrostate and a macrorun, respectively. 


3 A Modular Complementation Algorithm 


In a nutshell, the main idea of our BA complementation algorithm is to first decompose 
a BA A into several partition blocks according to their properties, and then perform 
complementation for each of the partition blocks (potentially using a different algorithm) 
independently, using either a synchronous construction, synchronizing the complemen- 
tation algorithms for all partition blocks in each step, or a postponed construction, which 
complements the partition blocks independently and combines the partial results using 
automata product construction. The decomposition of A into partition blocks can ei- 
ther be trivial—i.e., with one block for each accepting SCC—, or more elaborate, e.g., 
a partitioning where one partition block contains all accepting IWCs, another contains 
all DACs, and each NAC is given its own partition block. In this way, one can avoid 
running a general complementation algorithm for unrestricted BAs with the state com- 
plexity upper bound O((0.76n)”) and, instead, apply the most suitable complementation 
procedure for each of the partition blocks. This comes with three main advantages: 


1. The complementation algorithm for each partition block can be selected differently 
in order to exploit the properties of the block. For instance, for partition blocks 
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with IWCs, one can use complementation based on the breakpoint (the so-called 
Miyano-Hayashi) construction [42] with O(3”) macrostates (cf. Sec. 4.1), while for 
partition blocks with only DACs, one can use an algorithm with the state complex- 
ity O(4”) based on an adaptation of the NCSB construction [6,5,11,28] for SDBAs 
(cf. Sec. 4.2). For NACs, one can choose between, e.g., rank- [34,21,48,10,24,29] 
or determinization-based [46,43,45] algorithms, depending on the properties of the 
NACs (cf. Sec. 6). 

2. The different complementation algorithms can focus only on the respective blocks 
and do not need to consider other parts of the BA. This is advantageous, e.g., for 
rank-based algorithms, which can use this restriction to obtain tighter bounds on the 
considered ranks (even tighter than using the refinement in [29]). 

3. The obtained automaton can be more compact due to the use of a more general accep- 
tance condition than Büchi [47]—in general, it can be a conjunction of any EL con- 
ditions (one condition for each partition block), depending on the output of the com- 
plementation procedures; this can allow a more compact encoding of the produced 
automaton allowed by using a mixture of conditions. E.g., a deterministic BA can be 
complemented with constant extra generated states when using a co-Biichi condition 
rather than a linear number of generated states for a Biichi condition (see Sec. 5.1). 


Those partial complementation algorithms then need to be orchestrated by a top-level 
algorithm to produce the complement of A. 

One might regard our algorithm as an optimization of an approach that would for 
each partition block P obtain a BA Ap, complement Ap using the selected algorithm, 
and perform the intersection of all obtained Ap’s (which would, however, not be able 
to get the upper bound for elevator automata that we give in Sec. 4.3). Indeed, we also 
implemented the mentioned procedure (called the postponed approach, described in 
Sec. 5.2) and compared it to our main procedure (called the synchronous approach). 


3.1 Basic Synchronous Algorithm 

In this section, we describe the basic synchronous top-level algorithm. Then, in Sec. 4, 
we provide its instantiation for elevator automata and give a new upper bound for their 
complementation; in Sec. 5, we discuss several optimizations of the algorithm; and in 
Sec. 6, we give a generalization for unrestricted BAs. Let us fix a BA A = (Q, ô, I, F) 
and, w.l.o.g., assume that A is complete, i.e., |I| > 0 and all states q € Q have an 
outgoing transition over all symbols a € È. 

The synchronous algorithm works with partial complementation algorithms for BA’s 
partition blocks. Each such algorithm Alg is provided with a structural condition 941g 
characterizing partition blocks it can complement. For a BA $, we use the notation 8 |= 
y to denote that B satisfies the condition y. We say that Alg is a partial complementation 
algorithm for a partition block P if Ap | Ymg. We distinguish between Alg, a general 
algorithm able to complement a partition block of a given type, and Alg p, its instantiation 
for the partition block P. Each instance Algp is required to provide the following: 


— T18p — the type of the macrostates produced by the algorithm; 

Colours*!ér = {0,..., k*18 — 1} — the set of used colours; 

— Init48p e 27"? __ the set of initial macrostates; 

Succtler : (22 x TAler x Z) — 217P xColours®EP L 4 function returning the suc- 
cessors of a macrostate such that Succ’#8P (H, M,a) = {(Mj,a@1),..., (Mx, ax)}, 
where H is the set of all states of A reached over the same word, M is the Algp’s 
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macrostate for the given partition block, a is the input symbol, and each (M;, œ;) is 
a pair (macrostate, set of colours) such that M; is a successor of M over a w.r.t. H 
and a; is a set of colours on the edge from M to M; (H helps to keep track of new 
runs coming into the partition block); and 
— Acc™8P € EL(Colours™ 8P) — the acceptance condition. 
Let P1,...,Pn be a partitioning of A (w.l.o.g., we assume that n > 0), and 
Alg',...,Alg” be a sequence of algorithms such that Alg’ is a partial complemen- 
tation algorithm for P;. Furthermore, let us define the following auxiliary renumbering 


function A as A(c, j) =c+ Do |Colours* 8P; |, which is used to make the colours and 
acceptance conditions from the partial complementation algorithms disjoint. We also 
lift A to sets of colours in the natural way, and also to EL conditions such that A(y, j) has 
the same structure as y but each atom Inf (c) is substituted with the atom Inf (A(c, j)) (and 
likewise for Fin atoms). The synchronous complementation algorithm then produces 
the TELA MopCowpt(Algp, r+, Alg,» A) = (O°, 6°, 1°, T°, p©, Acc?) with com- 
ponents defined as follows (we use [S;]/_, to abbreviate S1 X -++ X Sn): 
- QF = 22 x [Ther ]2], - T° = {0,...,a(k en — 1,n)}, 


- IC ={I}x [Inite], — Acc? = 1 A(Acc*£?: i) Sand 


— 5° and p®© are defined such that if 
((Mj,@1),.--,(My,@n)) € [Succ ®": (H, Mj,a)]f.1, 


then 6° contains the transition t: (H, Mı,..., Mn) 5 (6(H,a), Mis... Mh), 
coloured by p° (t) = U{A(aqj, i) | 1 < i < n}, and 6° is the smallest such a set. 


In order for MopComp to be correct, the partial complementation algorithms need to 
satisfy certain properties, which we discuss below. 

For a structural condition y anda BA 8 = (Q, ô, I, F), we define 8 Ep yif 8 E gy, 
P is a partition block of 8, and S contains no accepting transitions outside P. We can 
now provide the correctness condition on Alg. 


Definition 1. We say that Alg is correct if for each BA B and partition block P such 
that B Ep Ymg it holds that L(MopCompt(Algp, 8)) = X° \ L(8). 


The correctness of the synchronous algorithm (provided that each partial comple- 
mentation algorithm is correct) is then established by Theorem 1. 


Theorem 1. Let A be a BA, P,,..., P, be a partitioning of A, and Alg',..., Alg" 
be a sequence of partial complementation algorithms such that Alg' is correct for P}. 
Then, we have L(MopComrr(Algp, , ...,Alg, ,A)) = 2° \ L(A). 


4 Modular Complementation of Elevator Automata 


In this section, we first give partial algorithms to complement partition blocks with 
only accepting IWCs (Sec. 4.1) and partition blocks with only DACs (Sec. 4.2). Then, 
in Sec. 4.3, we show that using our algorithm, the upper bound on the size of the 
complement of elevator BAs is in O(4"), which is exponentially better than the known 
upper bound O(16") established in [29]. 
8 If we drop the condition that A is complete, we also need to add an accepting sink state 
(representing the case for H = Ø) with self-loops over all symbols marked by a new colour @, 
and enrich Acc© with... v Inf(@). 
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4.1 Complementation of Inherently Weak Accepting Components 


First, we introduce a partial algorithm MH with the condition yyy specifying that all SCCs 
in the partition block P are accepting IWCs. Let P be a partition block of A such that 
Alp |= Ym. Our proposed approach makes use of the Miyano-Hayashi construction [42]. 
Since in accepting IWCs, all runs are accepting, the idea of the construction is to accept 
words such that all runs over the words eventually leave P. 

Therefore, we use a pair (C, B) of sets of states as a macrostate for complementing P. 
Intuitively, we use C to denote the set of all runs of A that are in P (C for “check’). The 
set B C C represents the runs being inspected whether they leave P at some point (B for 
“breakpoint’). Initially, we let C = I N P and also sample into breakpoint all runs in P, 
i.e., set B = C. Along reading an w-word w, if all runs that have entered P eventually 
leave P, i.e., B becomes empty infinitely often, the complement language of P should 
contain w (when B becomes empty, we sample B with all runs from the current C). We 
formalize MHp as a partial procedure in the framework from Sec. 3.1 as follows: 


— TE = QP x oP. Colours™? = {@}, Init? = {(IN P,INP)}, 

— Acc = Inf(@), and Succ? (H, (C, B), a) = {((C’, B’), œ)} where 
e C’ =6(H,a) NP, 

C ifB* =O for B*=5(B,a)nC’, | a if B* = Ø and 


e B’ = g 
B* otherwise, and 


0 otherwise. 

We can see that checking whether w is accepted by the complement of P reduces to 
check whether B has been cleared infinitely often. Since every time when B becomes 
empty, we emit the colour @, we have that w is not accepted by A within P if and only 
if © occurs infinitely often. Note that the transition function Suc c™ĦP is deterministic, 
i.e., there is exactly one successor. 


Lemma 1. The partial algorithm MH is correct. 


4.2 Complementation of Deterministic Accepting Components 
In this section, we give a partial algorithm CSB with the condition ¢csg_ specifying 
that a partition block P consists of DACs. Let P be a partition block of A such that 
Ap | Pcsg. Our approach is based on the NCSB family of algorithms [6,11,5,28] 
for complementing SDBAs, in particular the NCSB-MaxRank construction [28]. The 
algorithm utilizes the fact that runs in DACs are deterministic, i.e., they do not branch 
into new runs. Therefore, one can check that a run is non-accepting if there is a time 
point from which the run does not see accepting transitions any more. We call such 
a run that does not see accepting transitions any more safe. Then, an w-word w is not 
accepted in P iff all runs over w in P either (i) leave P or (ii) eventually become safe. 
For checking point (i), we can use a similar technique as in algorithm MH, i.e., use 
a pair (C, B). Moreover, to be able to check point (ii), we also use the set S that contains 
runs that are supposed to be safe, resulting in macrostates of the form (C, S, B)?. To 
make sure that all runs are deterministic, we will use scc instead of 6 when computing 
the successors of S and B since there may be nondeterministic jumps between different 
DACs in P; we will not miss any run in P since if a run moves between DACs of P, it 


° In contrast to MH, here we use C U S rather than C to keep track of all runs in P. 
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Fig. 1: Left: BA Aers (dots represent accepting transitions). Right: the outcome 
of MoDComrL(CSBp,, MHp,, Aer) with Acc: Inf(@) A Inf(@). States are given as 
(H, (Co, So, Bo), (C1, B1)); to avoid too many braces, sets are given as sums. 


can be seen as the run leaving P and a new run entering P. Since a run eventually stays 
in one SCC, this guarantees that the run will not be missed. 
We formalize CSBp in the top-level framework as follows: 


= Ter = 2P x 2P x QP, Init Br = {(I N P, 0,1 P)}, 
- Colours“? = {@}, Acc? = Inf(@), and 
— Succ ®r (H, (C, S, B),a) = U such that 
e if ôF (S,a) + 0, then U = 0 (Runs in S must be safe), 
e otherwise U contains ((C’, S”, B’), c) where 
* S’ = d6gcc(S,a) N P, C = (6(H,a) A P) \ S, 
B’ Cc’ if B* = @ for B* = dgcc(B, a), 10} if B* =0, 
: ~ | Be otherwise, and st 0 otherwise. 
Moreover, in the case p (B, a) = @, then U also contains ((C’’, S”, C”), {Q}) 
where $” = S’ U B’ and C” = C’ \ S”. 


Intuitively, when ôr (B, a) N scc (B, a) = 0, we make the following guess: (i) either the 
runs in B all become safe (we move them to S) or (ii) there might be some unsafe runs 
(we keep them in B). Since the runs in B are deterministic, the number of tracked runs 
in B will not increase. Moreover, if all runs in B are eventually safe, we are guaranteed 
to move all of them to S at the right time point, e.g., the maximal time point where all 
runs are safe since the number of runs is finite. 

As mentioned above, w is not accepted within P iff all runs over w either (i) leave P 
or (ii) become safe. In the context of the presented algorithm, this corresponds to 
(i) B becoming empty infinitely often and (ii) ôF (S,a) never seeing an accepting 
transition. Then we only need to check if there exists an infinite sequence of macrostates 
Ê = (Co, So, Bo) ... that emits © infinitely often. 


Lemma 2. The partial algorithm CSB is correct. 


It is worth noting that when the given partition block P contains all DACs of A, we 
can still use the construction above, while the construction in [28] only works on SDBAs. 
Example 1. InFig. 1, we give an example of the run of our algorithm on the BA A ex. The 
BA contains three SCCs, one of them (the one containing p) non-accepting (therefore, 
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it does not need to occur in any partition block). The partition block Po contains a single 
DAC, so we can use algorithm CSB, and the partition block Pı contains a single accepting 
IWC, so we can use MH. The resulting MopCompL(CSBp,, MHp, , Aex) uses two colours, 
© from CSB and @ from MH. The acceptance condition is Inf(@) A Inf (0). m 


4.3 Upper-bound for Elevator Automata Complementation 


We now give an upper bound on the size of the complement generated by our algo- 
rithm for elevator automata, which significantly improves the best previously known 
upper bound of O(16”) [29] to O(4"), the same as for SDBAs, which are a strict 
subclass of elevator automata [6] (we note that this upper bound cannot be obtained by 
a determinization-based algorithm, since determinization of SDBAs is in Q(n!) [17,40]). 


Theorem 2. Let A be an elevator automaton with n states. Then there exists a BA 
with O(4") states accepting the complement of L(A). 


Proof (Sketch). Let Qw be all states in accepting IWCs, Qp be all states in DACs, and 
Qn be the remaining states, i.e., Q = Ow Y Op Y Qn. We make two partition blocks: 
Po = Qw and Pı = Qp and use MH and CSB respectively as the partial algorithms, with 
macrostates of the form (H, (Co, Bo), (C1, 51, B1)). For each state qy € Qn, there are 
two options: either gy ¢ H or gn € H. For each state qw € Qw, there are three options: 
(i) qw € Co, (ii) qw € Co \ Bo, or (iii) qw € Co N Bo. Finally, for each qp € Qp, there 
are four options: (i) qp ¢ Cy U $1, (ii) gp €E S1, (ili) gp € C1 \ By, or (iv) qp € Cy By. 
Therefore, the total number of macrostates is 2 - 2!@n! . 31Qw! . 4l@n! € O(4”) where 
the initial factor 2 is due to degeneralization from two to one colour (the two colours 
can actually be avoided by using our shared breakpoint optimization from Sec. 5.4). O 


5 Optimizations of the Modular Construction 


In this section, we propose optimizations of the basic modular algorithm. In Sec. 5.1, 
we give a partial algorithm to complement initial partition blocks with DACs. Further, 
in Sec. 5.2, we propose the postponed construction allowing to use automata reduction 
on intermediate results. In Sec. 5.3, we propose the round-robin algorithm alleviating 
the problem with the explosion of the size of the Cartesian product of partial successors. 
In Sec. 5.4, we provide an optimization for partial algorithms that are based on the 
breakpoint construction, and, finally, in Sec. 5.5, we show how to employ simulation to 
decrease the size of macrostates in the synchronous construction. 


5.1 Complementation of Initial Deterministic Partition Blocks 


Our first optimization is an algorithm CoB for a subclass of partition blocks containing 

DACs. In particular, the condition Ycog specifies that the partition block P is deterministic 

and can be reached only deterministically in A (i.e., Ap after removing redundant states 

is deterministic). Then, we say that P is an initial deterministic partition block. The 

algorithm is based on complementation of deterministic BAs into co-Biichi automata. 
The algorithm CoB p is formalized below: 


— Toe = PU{O}, Init°®P = INP, Colours°®? ={@}, Acc? = Fin(®), 
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— Succ? (H, g,a) = {(q’,a)} where 
r if 6(H,a) AP = {r} and He if q > q' € F and 
e g= 


~ |Ø otherwise, 


d 
e — 


0 otherwise. 


Intuitively, all runs reach P deterministically, which means that over a word w, at 
most one run can reach P (so |Init®°??| = 1). Thus, we have |6(H,w;) N P| = 1 for 
some j > 0 if there is a run over w to P, corresponding to 6(H,a) N P = {r} in the 
construction. To check whether w is not accepted in P, we only need to check whether the 
run fromr € P over w visits accepting transitions only finitely often. We give an example 
of complementation of a BA containing an initial deterministic partition block in [27]. 


Lemma 3. The partial algorithm CoB is correct. 


5.2 Postponed Construction 


The modular synchronous construction from Sec. 3.1 utilizes the assumption that in the 
simultaneous construction of successors for each partition block over a, if one partial 
macrostate M; does not have a successor over a, then there will be no successor of the 
(H, M,,...,M,) macrostate in 6° as well. This is useful, e.g., for inclusion testing, 
where it is not necessary to generate the whole complement. On the other hand, if we 
need to generate the whole automaton, a drawback of the proposed modular construction 
is that each partial complementation algorithm itself may generate a lot of useless states. 
In this section, we propose the postponed construction, which complements the partition 
blocks (with their surrounding) independently and later combines the intermediate 
results to obtain the complement automaton for A. The main advantage of the postponed 
construction is that one can apply automata reduction (e.g., based on removing useless 
states or using simulation [13,18,1,9]) to decrease the size of the intermediate automata. 

In the postponed construction, we use product-based BA intersection operation (i.e., 
for two TELAs 8; and 82, a product automaton Bı N Bə satisfying L(B, N Bo) = 
£(B,) N L(B2)"). Further, we employ a function Red performing some language- 
preserving reduction of an input TELA. Then, the postponed construction for an elevator 
automaton A with a partitioning P;,..., Pn and a sequence Alig", ...,Alg” where Al gi 
is a partial complementation algorithm for P;, is defined as follows: 


n 
PostpCompt(Algp, , ce Alg> A) = () Red (MovCompr(Alg),,,Ar,)) . (2) 
i=1 
The correctness of the construction is then summarized by the following theorem. 


Theorem 3. Let A be a BA, P1,..., P, be a partitioning of A, and Alg',..., Alg" 
be a sequence of partial complementation algorithms such that Alg' is correct for P}. 
Then, L(PosteCompt(Alg,, » ...,Algp ,A)) = 2° \ L(A). 


5.3 Round-Robin Algorithm 


The proposed basic synchronous approach from Sec. 3.1 may suffer from the combinato- 
rial explosion because the successors of a macrostate are given by the Cartesian product 
of all successors of the partial macrostates. To alleviate this explosion, we propose 


10 Alternatively, one might also avoid the product and generate linear-sized alternating TELA, 
but working with those is usually much harder and not used in practice. 
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a round-robin top-level algorithm. Intuitively, the round-robin algorithm actively tracks 
runs in only one partial complementation algorithm at a time (while other algorithms 
stay passive). The algorithm periodically changes the active algorithm to avoid starvation 
(the decision to leave the active state is, however, fully directed by the partial comple- 
mentation algorithm). This can alleviate an explosion in the number of successors for 
algorithms that generate more than one successor (e.g., for rank-based algorithms where 
one needs to make a nondeterministic choice of decreasing ranks of states in order to be 
able to accept [34,21,48,10,24,29]; such a choice needs to be made only in the active 
phase while in the passive phase, the construction just needs to make sure that the run 
is consistent with the given ranking, which can be done deterministically). 

The round-robin algorithm works on the level of partial complementation round- 
robin algorithms. Each instance of the partial algorithm provides passive types to rep- 
resent partial macrostates that are passive and active types to represent currently active 
partial macrostates. In contrast to the basic partial complementation algorithms from 
Sec. 3.1, which provide only a single successor function, the round-robin partial al- 
gorithms provide several variants of them. In particular, SuccPass returns (passive) 
successors of a passive partial macrostate, Lift gives all possible active counterparts 
of a passive macrostate, and SuccAct returns successors of an active partial macrostate. 
If SuccAct returns a partial macrostate of the passive type, the round-robin algorithm 
promotes the next partial algorithm to be the active one. For instance, in the round-robin 
version of CSB, the passive type does not contain the breakpoint and only checks that 
safe runs stay safe, so it is deterministic. Due to space limitations, we give a formal 
definition and more details about the round-robin algorithm in [27]. 


5.4 Shared Breakpoint 


The partial complementation algorithms CSB and MH (and later RNK defined in Sec. 6) 
use a breakpoint to check whether the runs under inspection are accepting or not. As 
an optimization, we consider merging of breakpoints of several algorithms and keeping 
only a single breakpoint for all supported algorithms. The top-level algorithm then needs 
to manage only one breakpoint and emit a colour only if this sole breakpoint becomes 
empty. This may lead to a smaller number of generated macrostates since we synchronize 
the breakpoint sampling among several algorithms. The second benefit is that this allows 
us to generate fewer colours (in the case of elevator automata complemented using 
algorithms CSB and MH, we get only one colour). 


5.5 Simulation Pruning 


Our construction can be further optimized by a simulation (or other compatible) relation 
for pruning macrostates."“ A simulation is, broadly speaking, a relation < © Q x 
Q implying language inclusion of states, i.e., Yp,q € Q: p <q = L(Al[p]) € 
£(Al[gq]). Intuitively, our optimization allows to remove a state p from a macrostate M 
if there is also a state q in M such that (i) p < q, (ii) p is not reachable from q, and 
(iii) p is smaller than q in an arbitrary total order over Q (this serves as a tie-breaker for 


u This optimization can be seen as a generalization of the simulation-based pruning techniques 
that appeared, e.g., in [41,28] in the context of concrete determinization/complementation 
procedures. Here, we generalize the technique to all procedures that are based on run tracking. 
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simulation-equivalent mutually unreachable states). The reason why p can be removed 
is that its behaviour can be completely mimicked by q. In our construction, we can then, 
roughly speaking, replace each call to the functions 6(U, a) and ôp (U, a), for a set of 
states U, by pr(6(U, a)) and pr(ôr (U, a)) respectively in each partial complementation 
algorithm, as well as in the top-level algorithm, where pr(S) is obtained from S by 
pruning all eligible states. The details are provided in [27]. 


6 Modular Complementation of Non-Elevator Automata 


A non-elevator automaton A contains at least one NAC, besides possibly other IWCs 
or DACs. To complement A in a modular way, we apply the techniques seen in Sec. 4 
to its DACs and IWCs, while for its NACs we resort to a general complementation 
algorithm Alg. In theory, rank- [34], slice- [32], Ramsey- [50], subset-tuple- [2], and 
determinization- [46] based complementation algorithms adapted to work on a single 
partition block instead of the whole automaton are all valid instantiations of Alg. Below, 
we give a high-level description of two such algorithms: rank- and determinization-based. 


Rank-based partial complementation algorithm. Working on each NAC independently 
benefits the complementation algorithm even if the input BA contains only NACs. For 
instance, in rank-based algorithms [34,21,48,33,10,24,29], the fact whether all runs 
of A over a given w-word w are non-accepting is determined by ranks of states, 
given by the so-called ranking functions. A ranking function is a (partial) function 
from Q to w. The main idea of rank-based algorithms is the following: (i) every run is 
initially nondeterministically assigned a rank, (ii) ranks can only decrease along a run, 
(iii) ranks need to be even every time a run visits an accepting transition, and (iv) the 
complement automaton accepts iff all runs eventually get trapped in odd ranks’. In the 
standard rank-based procedure, the initial assignment of ranks to states in (i) is a function 
Q — {0,...,2n — 1} for n = |Q|. Using our framework, we can, however, significantly 
restrict the considered ranks in a partition block P to only P — {0,...,2m — 1} for 
m = |P| (here, it makes sense to use partition blocks consisting of single SCCs). One 
can further reduce the considered ranks using the techniques introduced in, e.g., [24,29]. 

In order to adapt the rank-based construction as a partial complementation algorithm 
RNK in our framework, we need to extend the ranking functions by a fresh “box state” = 
representing states outside the partition block. The ranking function then uses m to 
represent ranks of runs newly coming into the partition block. The box-extension also 
requires to change the transition in a way that = always represents reachable states from 
the outside. We provide the details of the construction, which includes the MaxRank 
optimization from [24], in [27]. 


Determinization-based partial complementation algorithm. In [52,29] we can see that 
determinization-based complementation is also a good instantiation of Alg in practice, 
so, we also consider the standard Safra-Piterman determinization [46,43,45] as a choice 
of Alg for complementing NACs. Determinization-based algorithms use a layered subset 
construction to organize all runs over an w-word w. The idea is to identify a subset S C H 
of reachable states that occur infinitely often along reading w such that between every two 
occurrences of S, we have that (i) every state in the second occurrence of S can be reached 


12 Since we focus on intuition here, we use runs rather than the directed acyclic graphs of runs. 


Modular Mix-and-Match Complementation of Biichi Automata 261 


Table 1: Statistics for our experiments. The column unsolved classifies unsolved in- 
stances by the form timeouts : out of memory : other failures. For the cases of VBS we 
provide just the number of unsolved cases. The columns states and runtime provide 
mean : median of the number of states and runtime, respectively. 


tool solved unsolved states runtime tool solved unsolved states runtime 


COLA 39,814 21: 0:2 80:3 0.17:0.02 
RANKER 38,837 61:939:0 45:4 3.31:0.01 
VBS: 39,834 3 78: 3 0.05:0.01 SEMINATOR 39,026 238 : 573 :0 247:3 1.98:0.03 
VBS_ 39,834 3 96: 3 0.05:0.01 Spot 39,827 8: 0:2 160:4 0.08:0.02 


by a state in the first occurrence of S and (ii) every state in the second occurrence is 
reached by a state in the first occurrence while seeing an accepting transition. According 
to K6nig’s lemma, there must then be an accepting run of A over w. 

The construction initially maintains only one set H: the set of reachable states. 
Since S as defined does not necessarily need to be H, every time there are runs visiting 
accepting transitions, we create a new subset C for those runs and remember which 
subset C is coming from. This way, we actually organize the current states of all runs 
into a tree structure and do subset construction in parallel for the sets in each tree node. 
If we find a tree node whose labelled subset, say S’, is equal to the union of states in 
its children, we know the set S’ satisfies the condition above and we remove all its child 
nodes and emit a good event. If such good event happens infinitely often, it means that 
S’ also occurs infinitely often. So in complementation, we only need to make sure those 
good events only happen for finitely many times. Working on each NAC separately also 
benefits the determinization-based approach since the number of possible trees will be 
less with smaller number of reachable states. Following the idea of [37], to adapt for 
the construction as the partial complementation algorithm, we put all the newly coming 
runs from other partition blocks in a newly created node without a parent node. In this 
way, we actually maintain a forest of trees for the partial complementation construction. 
We denote the determinization-based construction as DET; cf. [37] for details. 


7 Experimental Evaluation 


To evaluate the proposed approach, we implemented it in a prototype tool Korora [25] 
(written in C++) built on top of Spor [16] and compared it against COLA [37], 
RANKER [28] (v. 2), SEMINATOR [5] (v. 2.0), and Spor [15,16] (v. 2.10.6), which are 
the state of the art in BA complementation [29,28,37]. Due to space restrictions, we 
give results for only two instantiations of our framework: KoFoLAs and KoFota p. Both 
instantiations use MH for IWCs, CSB for DACs, and DET for NACs. The partitioning 
selection algorithm merges all IWCs into one partition block, all DACs into one par- 
tition block, and keeps all NACs separate. Simulation-based pruning from Sec. 5.5 is 
turned on, and round-robin from Sec. 5.3 is turned off (since the selected algorithms 
are quite deterministic). KoFOLAs employs the synchronous and KoFoLa p employs the 
postponed strategy. We also consider the Virtual Best Solver (VBS), i.e., a virtual tool 
that would choose the best solver for each single benchmark among all tools (VBS,) and 
among all tools except both versions of KorFoLA (VBS_). We ran our experiments on an 
Ubuntu 20.04.4 LTS system running on a desktop machine with 16 GiB RAM and an 
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Fig. 2: Scatter plots comparing the numbers of states generated by the tools. 


Intel 3.6 GHz i7-4790 CPU. To constrain and collect statistics about the executions of 
the tools, we used BENCHExEc [3] and imposed a memory limit of 12 GiB and a timeout 
of 10 minutes; we used Spor to cross-validate the equivalence of the automata generated 
by the different tools. An artifact reproducing our experiments is available as [26]. 

As our data set, we used 39,837 BAs from the AUTOMATA-BENCHMARKS reposi- 
tory [36] (used before by, e.g., [29,28,37]), which contains BAs from the following 
sources: (i) randomly generated BAs used in [52] (21,876 BAs), (ii) BAs obtained from 
LTL formulae from the literature and randomly generated LTL formulae [5] (3,442 BAs), 
(iii) BAs obtained from ULTIMATE AUTOMIZER [11] (915 BAs), (iv) BAs obtained from 
the solver for first-order logic over Sturmian words Pecan [31] (13,216 BAs), (v) BAs 
obtained from an S1S solver [23] (370 BAs), and (vi) BAs from LTL to SDBA trans- 
lation [49] (18 BAs). From these BAs, 23,850 are deterministic, 6,147 are SDBAs (but 
not deterministic), 4,105 are elevator (but not SDBAs), and 5,735 are the rest. 

In Table 1 we present an overview of the outcomes. Despite being a prototype, 
Korora can already complement a large portion of the input automata, with very few 
cases that can be complemented successfully only by Spot or COLA. Regarding the 
mean number of states, KoFoLAs has the least mean value from all tools (except 
RANKER, which, however, had 1,000 unsolved cases) Moreover, Korora significantly 
decreased the mean number of states when included into the VBS: from 96 to 78! 
We consider this to be a strong validation of the usefulness of our approach. Regarding 
the runtime, both versions of Koroa are rather similar; KoFoLa is just slightly slower 
than Spot and COLA but much faster than both RANKER and SEmrinator (cf. [27]). 

In Fig. 2 we present a comparison of the number of states generated by KoFoLA s and 
other tools; we omit VBS, since the corresponding plot can be derived from the one for 
VBS- (since RANKER and SEMINATOR only output BAs, we compare the sizes of outputs 
transformed into BAs for all tools to be fair). In the plots, the number of benchmarks 
represented by each mark is given by its colour; a mark above the diagonal means that 
Kororas generated a BA smaller than the other tool while a mark on the top border 
means that the other tool failed while KoroLas succeeded, and symmetrically for the 
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bottom part and the right-hand border. Dashed lines represent the maximum number of 
states generated by one of the tools in the plot, axes are logarithmic. 

From the results, KoFoLAs clearly domi- 
nates state-of-the-art tools that are not based 
on SCC decomposition (RANKER, SPOT, SEM- 
INATOR). The outputs are quite comparable to 
COLA, which also uses SCC decomposition 
and can be seen as an instantiation of our frame- 
work. This supports our intuition that working 
on the single SCCs helps in reducing the size | | | 
of the final automaton, confirming the validity 10' 10° 10 
of our modular mix-and-match Büchi comple- States KOFOLAs 
mentation approach. Lastly, in the figure in the right we compare our algorithm for 
elevator automata with the one in RANKER (the only other tool with a dedicated algo- 
rithm for this subclass). Our new algorithm clearly dominates the one in RANKER. 
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8 Related Work 


To the best of our knowledge, we provide the first general framework where one can 
plug-in different BA complementation algorithms while taking advantage of the specific 
structure of SCCs. We will discuss the difference between our work and the literature. 

The breakpoint construction [42] was designed to complement BAs with only IWCs, 
while our construction treats it as a partial complementation procedure for IWCs and 
differs in the need to handle incoming states from other partition blocks. The NCSB 
family of algorithms [6,1 1,5,28] for SDBAs do not work when there are nondeterministic 
jumps between DACs; they can, however, be adapted as partial procedures for comple- 
menting DACs in our framework, cf. Sec. 4.2. In [29], a deelevation-based procedure 
is applied to elevator automata to obtain BAs with a fixed maximum rank of 3, for 
which a rank-based construction produces a result of the size in O(16"). In our work, 
we exploit the structure of the SCCs much more to obtain an exponentially better upper 
bound of O(4”) (the same as for SDBAs). The upper bound O(4”) for complementing 
unambiguous BAs was established in [39], which is orthogonal to our work, but seems 
to be possible to incorporate into our framework in the future. 

There is a huge body of work on complementation of general BAs 
[8,50,7,34,21,22,10,24,29,48,2,46,43,45,5,52,32,53,19,20]; all of them work on the 
whole graph structure of the input BAs. Our framework is general enough to allow 
including all of them as partial complementation procedures for NACs. On the contrary, 
our framework does not directly allow (at least in the synchronous strategy) to use al- 
gorithms that do not work on the structure of the input BA, such as the learning-based 
complementation algorithm from [38]. The recent determinization algorithm from [37], 
which serves as our inspiration, also handles SCCs separately (it can actually be seen 
as an instantiation of our framework). Our current algorithm is, however, more flex- 
ible, allowing to mix-and-match various constructions, keep SCCs separate or merge 
them into partition blocks, and allows to obtain the complexity O(4”), while [37] only 
allowed O(n!) (which is tight since SDBA determinization is in Q(n!) [17,40]). 

Regarding the tool Spor [15,16], it should not be perceived as a single comple- 
mentation algorithm. Instead, Spot should be seen as a highly engineered platform 
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utilizing breakpoint construction for inherently weak BAs, NCSB [6,11] for SDBAs, 
and determinization-based complementation [46,43,45] for general BAs, while using 
many other heuristics along the way. SEMINATOR uses semi-determinization [14,4,5] to 
make sure the input is an SDBA and then uses NCSB [6,11] to compute the complement. 


9 Conclusion and Future Work 


We have proposed a general framework for BA complementation where one can plug-in 
different partial complementation procedures for SCCs by taking advantage of their 
specific structure. Our framework not only obtains an exponentially better upper bound 
for elevator automata, but also complements existing approaches well. As shown by the 
experimental results (especially for the VBS), our framework significantly improves the 
current portfolio of complementation algorithms. 

We believe that our framework is an ideal testbed for experimenting with different 
BA complementation algorithms, e.g., for the following two reasons: (i) One can develop 
an efficient complementation algorithm that only works for a quite restricted sub-class of 
BAs (such as the algorithm for initial deterministic SCCs that we showed in Sec. 5.1) and 
the framework can leverage it for complementation of all BAs that contain such a sub- 
structure. (ii) When one tries to improve a general complementation algorithm, they can 
focus on complementation of the structurally hard SCCs (mainly the nondeterministic 
accepting SCCs) and do not need to look for heuristics that would improve the algorithm 
if there were some easier substructure present in the input BA (as was done, e.g., in [29]). 
From how the framework is defined, it immediately offers opportunities for being used 
for on-the-fly BA language inclusion testing, leveraging the partial complementation 
procedures present. Finally, we believe that the framework also enables new directions 
for future research by developing smart ways, probably based on machine learning, of 
selecting which partial complementation procedure should be used for which SCC, based 
on their features. In future, we want to incorporate other algorithms for complementation 
of NACs, and identify properties of SCCs that allow to use more efficient algorithms 
(such as unambiguous NACs [39]). Moreover, it seems that generalizing the DELAYED 
optimization from [24] on the top-level algorithm could also help reduce the state space. 
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Abstract. We present a new streaming algorithm to validate JSON doc- 
uments against a set of constraints given as a JSON schema. Among the 
possible values a JSON document can hold, objects are unordered col- 
lections of key-value pairs while arrays are ordered collections of values. 
We prove that there always exists a visibly pushdown automaton (VPA) 
that accepts the same set of JSON documents as a JSON schema. Lever- 
aging this result, our approach relies on learning a VPA for the provided 
schema. As the learned VPA assumes a fixed order on the key-value pairs 
of the objects, we abstract its transitions in a special kind of graph, and 
propose an efficient streaming algorithm using the VPA and its graph 
to decide whether a JSON document is valid for the schema. We evalu- 
ate the implementation of our algorithm on a number of random JSON 
documents, and compare it to the classical validation algorithm. 


Keywords: Visibly pushdown automata - JSON - streaming validation 


1 Introduction 


JavaScript Object Notation (JSON) has overtaken XML as the de facto standard 
data-exchange format, in particular for web applications. JSON documents are 
easier to read for programmers and end users since they only have arrays and 
objects as structured types. Moreover, in contrast to XML, they do not include 
named open and end tags for all values, but open and end tags (braces actually) 
for arrays and objects only. JSON schema [13] is a simple schema language that 
allows users to impose constraints on the structure of JSON documents. 

In this work, we are interested in the validation of streaming JSON docu- 
ments against JSON schemas. Several previous results have been obtained about 
the formalization of XML schemas and the use of formal methods to validate 
XML documents (see, e.g., [5, 15,16, 18,24, 25]). Recently, a standard to formal- 
ize JSON schemas has been proposed and (hand-coded) validation tools for such 
schemas can be found online [13]. Pezoa et al, in [19], observe that the standard 
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of JSON documents is still evolving and that the formal semantics of JSON 
schemas is also still changing. Furthermore, validation tools seem to make differ- 
ent assumptions about both documents and schemas. The authors of [19] carry 
out an initial formalization of JSON schemas into formal grammars from which 
they are able to construct a batch validation tool from a given JSON schema. 
In this paper, we rely on the formalization work of [19] and propose a stream- 
ing algorithm for validating JSON documents against JSON schemas. To our 
knowledge, this is the first JSON validation algorithm that is streaming. For 
XML, works that study streaming document validation base such algorithms 
on the construction of some automaton (see, e.g., [25], for XML). In [7], we 
first experimented with one-counter automata for this purpose. We submit that 
visibly-pushdown automata (VPAs) are a better fit for this task — this is in 
line with [15], where the same was proposed for streaming XML documents. In 
contrast to one-counter automata,? we show that VPAs are expressive enough 
to capture the language of JSON documents satisfying any JSON schema. 
More importantly, we explain that active learning à la Angluin [4] is a good 
alternative to the automatic construction of such a VPA from the formal seman- 
tics of a given JSON schema. This is possible in the presence of labeled examples 
or a computer program that can answer membership and (approximate) equiv- 
alence queries about a set of JSON documents. This learning approach has two 
advantages. First, we derive from the learned VPA a streaming validator for 
JSON documents. Second, by automatically learning an automaton representa- 
tion, we circumvent the need to write a schema and subsequently validate that 
it represents the desired set of JSON documents. Indeed, it is well known that 
one of the highest bars that users have to clear to make use of formal methods is 
the effort required to write a formal specification, in this case, a JSON schema. 


Contributions. We present a VPA active learning framework to achieve what was 
mentioned above — though we fix an order on the keys appearing in objects. 
The latter assumption helps our algorithm learn faster. Secondly, we show how to 
bootstrap the learning algorithm by leveraging existing validation and document- 
generation tools to implement approximate equivalence checks. Thirdly, we de- 
scribe how to validate streaming documents using our fixed-order learned au- 
tomata — that is, our algorithm accepts other permutations of keys, not just 
the one encoded into the VPA. Finally, we present an empirical evaluation of 
our learning and validation algorithms, implemented on top of LEARNLIB [17]. 

All contributions, while complementary, are valuable in their own right. First, 
our learning algorithm for VPAs is a novel gray-box extension of TTT [9] that 
leverages side information about the language of all JSON documents. Second, 
our validation algorithm that uses a fixed-order VPA is novel and can be applied 
regardless of whether the automaton is learned or constructed from a schema. 
For the validation algorithm, we developed the concept of key graph, which allows 
us to efficiently realize the validation no matter the key-value order in the docu- 


3By nesting objects and arrays, we obtain a set of JSON documents encoding 
{a"b™c'™d” | n,m € N}, a context-free language that requires two counters. 
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ment, and might be of independent interest for other JSON-analysis applications 
using VPAs. Finally, we implemented our own batch validator to facilitate ap- 
proximating equivalence queries as required by our learning algorithm. Both the 
new validator and the equivalence oracle are efficient, open-source, and easy to 
modify. We strongly believe the latter can be re-used in similar projects aiming 
to learn automata representations of sets of JSON documents. 

A long version of this work is on arXiv: https://arxiv.org/abs/2211.08891. 


2 Visibly Pushdown Languages 


First, we recall the definition VPAs [3] and state some of their properties. We 
also recall how they can be actively learned following Angluin’s approach [4]. 


Visibly Pushdown Automata An alphabet X is a finite set whose elements 
are called symbols. A word w over X is a finite sequence of symbols from X, 
with the empty word denoted by £. The length of w is denoted |w|; the set of 
all words, X*. Given two words v,w € X*, v is a prefix (resp. suffix) of w if 
there exists u € X* such that w = vu (resp. w = uv), and v is a factor of w if 
there exist u,u’ € X* such that w = uvu’. Given L C X*, called a language, we 
denote by Pref(Z) (resp. Suff(L)) the set of prefixes (resp. suffixes) of words of 
L. Given a set Q, we write Ig for the identity relation {(q,q) |q E Q} on Q. 
VPA [3] are particular pushdown automata that we recall in this section. 
The pushdown alphabet, denoted £ = (Xe, X», Xi), is partitioned into pairwise 
disjoint alphabets Xe, Xy, ©; such that Xe (resp. Xr, X1) is the set of call sym- 
bols (resp. return symbols, internal symbols). In this paper, we work with the 
particular alphabet of return symbols X, = {a | a € Xe}. For any such X, we 
denote by X the alphabet Xe U X, U X;. Given a pushdown alphabet 5 , the set 


WM(X) of well-matched words over X is defined: 


— e € WM(5), and a € WM(2) for all a € Si, 
— if w,w’ € WM(~), then ww’ e WM(X), _ 
— if a € Xe, w E€ WM(2), then awa € WM(2). 


Also, the call/return balance function 8 : X* — Z is defined as (e) = 0 and 
B(ua) = B(u) + x with x being 1, —1, or 0 if a is in Xe, Xy, or X; respectively. 
In particular, for all w € WM(), we have (u) > 0 for each prefix u of w 
and (u) < 0 for each suffix u of w. Finally, the depth d(w) of a well-matched 
word w is equal to max{ 8(u) | u € Pref({w})}, that is, the maximum number 
of unmatched call symbols among the prefixes of w. 


Definition 1. A visibly pushdown automaton (VPA) over a pushdown alphabet 
5 is a tuple (Q,¥,T,5,Q1,Qr) where Q is a finite non-empty set of states, 
Qr C Q is a set of initial states, Qr C Q is a set of final states, I is a stack 
alphabet, and 6 is a finite set of transitions of the form 6 = ôe U ôr U6; where 
de CQx Lx QxT is the set of call transitions, ôr CQ x Xx Ix Q is the 
set of return transitions, and 6; C Q x Xi x Q is the set of internal transitions. 
The size of A is denoted by |Q|, and its number of transitions by |ô]. 
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Let us describe the transition system T4 of a VPA A whose vertices are con- 
figurations. A configuration is a pair (q,o) where q € Q is a state and o € I™* a 
stack content. A configuration is initial (resp. final) if q E Qz (resp. q E€ Qr) and 
o =e. For a € X, we write (q, o) 5 (q',o') in Ty if there is either a call tran- 
sition (q,a, q’, y) € 6c verifying o’ = yg ,* or a return transition (q,a, y, q’) € ôr 
verifying o = yo’, or an internal transition (q,a, q’) € 6; such that o’ =o. 

The transition relation of T4 is extended to words in the usual way. We say 
that A accepts a word w € X* if there exists a path in T4 from an initial config- 
uration to a final configuration that is labeled by w. The language of A, denoted 
by L(A), is defined as L(A) = {w € X* | Sq € Qr,5q' € Qr, (qe)  (d,2)}, 
i.e., the set of all words accepted by A. Any language accepted by some VPA 
is a visibly pushdown language (VPL). Notice that such a language is composed 
of well-matched words only. Given a VPA A over X, the reachability relation 
Reach, of A is Reach, = {(q,q') € Q? | Jw € WM(5), (q,€) S (q', €}. 

Finally, we say that p € Q is a bin state if there exists no path in T4 of the 
form (q,¢) S (p,c) Ws (q',€) with q € Qr and qd’ € Qr. If a VPA A has bin 
states, those states can be removed from Q as well as the transitions containing 
bin states without modifying the accepted language. 


Minimal Deterministic VPAs Given a VPA A = (Q, »',T,6,Q1, Qr), we say 
that it is deterministic (det-VPA) if |Qr| = 1 and A does not have two distinct 
transitions with the same left-hand side. By left-hand side, we mean (q,a) for a 
call transition (q,a, q’, y) € ôe or an internal transition (q, a, q’) € 6;, and (q,a, 7) 
for a return transition (q,a, y, q’) € ôr. 


Theorem 1 ( [3,32]). For any VPA A over 3’, one can construct a det- VPA 
B over X such that L(A) = L(B). Moreover, the size of B is in O(21Ql’) and the 
size of its stack alphabet is in O(|X.| < 212°). 


Proof. Let us briefly recall this construction. Let A = (Q, X, T,ô, Qr, Qr). The 
states of B are subsets R of the reachability relation Reach, of A and the 
stack symbols of B are of the form (R,a) with R C Reach, and a € Xe. Let 
W = Uj{A1U242...UnGnUn4+1 be such that n > 0 and w; € WM(), a; E€ De 
for all i. That is, we decompose w in terms of its unmatched call symbols. 
Let R; be equal to {(p,q) | (p,e) = (q,£)} for all i. Then after reading w, 
the det-VPA B has its current state equal to Rn+ı and its stack containing 
(Rn; Qn)... (Re, a2)(R1, a1). Assume we are reading the symbol a after w, then 
B performs the following transition from R,4+1: (1) ifa € Xe, then push (Rn+1,4) 
on the stack and go to the state R = Ig (a new unmatched call symbol is read); 
(2) if a € X;, then go to the state R = {(p,q) | 3(p,p') E€ Rnsi, (p’,a,q) € ôi} 
(Un+1 is extended to the well-matched word uy41a); (3) if a € Xp, then pop 
(Rn, an) from the stack if än = a, and go to the state 


R a {(p, q) | A(p, p’) € Rn, (0 tag’ 4) € be, (r,r) € Rn+1, (r,a, %,q) € ôr} 


“The stack symbol y is pushed on the left of ø. 
5The original definition of VPA [3] allows acceptance of ill-matched words. 
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(the call symbol a,, is matched with the return symbol a = Gy, leading to the 
well-matched word Unanun+1a). Finally the initial state of B is Ig, and its final 
states are sets R containing some (p,q) with p € Q; and q E€ QF. 


Though a VPL L in general does not have a unique minimal det-VPA A 
accepting L, imposing the following subclass leads to a unique minimal acceptor. 


Definition 2 ( [2,9]). A 1-module single entry VPA® (1-SEVPA) is a det- 


VPA A = (Q,2,T,6,Q1 = {qo},Qr) such that its stack alphabet I’ is equal to 
Q x Xe, and all its call transitions (q,a,q',7) € ĝe are such that q! = qo and 


y= (q,a). 


Theorem 2 ( [2]). For any VPL L, there exists a unique minimal (with regards 
to the number of states) 1-SEVPA accepting L, up to a renaming of the states.” 


Learning VPAs Let us recall the concept of learning a deterministic finite 
automaton (DFA), as introduced in [4]. Let L be a regular language over an 
alphabet X. The task of the learner is to construct a DFA H such that L(H) = 
L by interacting with the teacher. The two possible types of interactions are 
membership queries (does w E€ X* belong to L?), and equivalence queries (does 
the DFA H accept L?). For the latter type, if the answer is negative, the teacher 
also provides a counterexample, i.e., a word w such that w € L & w ¢ L(H). The 
so-called L* algorithm of [4] learns at least one representative per equivalence 
class of the Myhill-Nerode congruence of L [8] from which the minimal DFA 
D accepting L is constructed. This learning process terminates and it uses a 
polynomial number of membership and equivalence queries in the size of D, and 
in the length of the longest counterexample returned by the teacher [4]. 

In [9], an extension of Angluin’s learning algorithm is given for VPLs. The 
Myhill-Nerode congruence for regular languages is extended to VPLs as follows. 
Given a pushdown alphabet 5 and a VPL L over X , we consider the set of context 
pairsCP(S) = {(u,v) € (WM(È) - 34)" x Suff(WM(5)) | B(u) = —B(u)}, 
and we define the equivalence relation ~€ WM(’) x WM(2’) [2,9] such that 
w ~r w if and only if V(u,v) € CP(X),uwv € L & uwv € L. The minimal 
1-SEVPA accepting L as described in Theorem 2 is constructed from ~z such 
that its states are the equivalence classes of ~z. 


Theorem 3 ( [9]). Let L be a VPL over © and n be the index of ~r. queries 
and a number of membership queries polynomial in n, |X|, and log £, where £ is 
the length of the longest countererample returned by the teacher. 


The learning process designed in [9] extends to VPLs the TTT algorithm pro- 
posed in [10] for regular languages. TTT improves the efficiency of the L* algo- 
rithm by eliminating redundancies in counterexamples provided by the teacher. 


®The definitions of 1-SEVPA in [2] and [9] differ slightly. We follow the one in [9]. 
"This 1-SEVPA may be exponentially bigger than the size of a VPA accepting L. 
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3 JSON Format 


In this section, we describe JSON documents [6] and JSON schemas [13] that 
impose some constraints on the structure of JSON documents. We also present 
the abstractions we make for the purpose of this paper. 


JSON Documents We describe the structure of JSON documents. Our presen- 
tation is inspired by [19], though some details are skipped for readability (see [14] 
for a full description). The JSON format defines different types of JSON values: 


— true, false, null are JSON values. Any decimal number (positive, negative) 
isa JSON value, called a number. In particular any number that is an integer 
is called an integer. Any finite sequence of characters starting and ending 
with " is a string value. All those values are called primitive values. 

— If v1, v2,..-,Un are JSON values and kı, k2,...,kņn are pairwise distinct 
string values, then {k,:v1,ko:v2,...,kn:Un} is a JSON value, called an 
object. Each kj: vu; is called a key-value pair such that k; is the key. The 
collection of these pairs is unordered. 

— If v1, v2,...,Un are JSON values, then [v1, v2, ..., Un] is a JSON value, called 
an array. Each v; is an element and the collection thereof is ordered. 


In this work, JSON documents are supposed to be objects.® One can use JSON 
pointers to navigate through a document, e.g., if J is an object and k is a key, 
then J[k] is the value v such that the key-value pair k:v appears in J. 

In this paper, we consider somewhat abstract JSON documents. We see JSON 
documents as well-matched words over the pushdown alphabet S’jgon that we 
describe hereafter. We abstract all string values as s, and all numbers as n (as 
i when they are integers). We denote by Xpvaı = {true, false, null,s,n, i} 
the alphabet composed of the six primitive values. Concerning the key-value 
pairs appearing in objects, each key together with the symbol “:” following the 
key is abstracted as an alphabet symbol k. We assume knowledge of a finite 
alphabet Xkey of keys. We define the pushdown alphabet Xjson = (Xe, Xr, Xi) 
with X; = XkeyUXpvalU{#}, where # is used in place of the comma; Xe = {<,C}, 
where < (resp. C) is used in place of “{” (resp. “[”); and X, = {>, I}, with 
<=> and E = J. We denote by S'jgon the set Xe U X, U Xj. 


Example 1. An example of a JSON document is given in Listing 1. We can see 
that this document is an object containing three keys: "title", whose associated 
value is a string value; "keywords", whose value is an array containing string 
values; and "conf", whose value is an object. This inner object contains two keys: 
"name", whose value is a string value; "year", whose value is an integer. The 
pointer J[conf] [name], where J is the root of the document, retrieves the value 
"TACAS". The JSON document is abstracted as the word <kis#koCs#s#siI# 
k3<kas #ksi=> € WM(2json) where Xkey contains the keys kj,i € {1,...,5}. 


8Tn [6], a JSON document can be any JSON value and duplicated keys are allowed 
inside objects. In this paper, we follow what is commonly used in practice: JSON 
documents are objects, and keys are pairwise distinct inside objects. 
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1 { "title": "Validating Streaming JSON Documents with Learned VPAs", 
2 "keywords": ["VPA", "JSON documents", "streaming validation"], 
3 "conf": { "name": "TACAS", "year": 2023 } 
4 } 
Listing 1: A JSON document. 
1 { "type": "object", 
2 "required": ["title", "conf"], 
3 "properties": { 
4 "title": { "type": "string" }, 
5 "keywords": { "type": "array", “items": { "type": "string" } }, 
6 "cont": { 
7 "type": "object", 
8 "required": ["name", "year"], 
9 "properties":{ "name":{"type": "string"},"year":{"type": "integer"}}}}} 


Listing 2: A JSON schema. 


JSON Schemas A JSON schema can impose some constraints on JSON doc- 
uments by specifying any of the types of JSON values that appear in those 
documents. We say that a JSON document satisfies (or is valid for) the schema 
if it verifies the constraints imposed by this schema. We denote by £(S) the set 
of documents that are valid for S. In this section, we give a simplified presen- 
tation of JSON schemas and refer to [13] for a complete description and to [19] 
for a formalization (i.e. a formal grammar with its syntax and semantics). 

A JSON schema is itself a JSON document that uses several keywords that 
help shape and restrict the set of JSON documents that this schema specifies. As 
we abstract JSON documents, JSON schemas we work on are also abstracted. 
We do not consider the restrictions that can be imposed on string values and 
numbers, for instance. We give here a few examples. See [13] for more details. 


— Within object schemas, restrictions can be imposed on the key-value pairs 
of the objects, e.g., the value associated with some key has itself to satisfy a 
certain schema, or some particular keys must be present in the object. 

— Within array schemas, it can be imposed that all elements of the array satisfy 
a certain schema, or that the array has a minimum/maximum size. 

— Schemas can be combined with Boolean operations, e.g., a JSON document 
must satisfy a conjunction of several schemas. 

— A schema can be defined as one referred to by a JSON pointer. This allows 
a recursive structure for the JSON documents satisfying a certain schema. 


Example 2. The schema from Listing 2 describes objects that can have three 
keys: "title", whose associated value must be a string value; "keywords", an 
array of strings; and "conf", an object. Among these, "title" and "conf" are 
required. The JSON document of Example 1 satisfies this JSON schema. 


Under these abstractions, we can always construct a VPA that accept the 
same set of JSON documents than a schema S, as shown in the following the- 
orem. We also extend this construction to the case where we fix an order < 
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on Xkey and consider the set £<(S) of documents valid for S whose key order 
inside objects respects this order <. The main idea of the proof is to define a 
formalism of JSON schemas as extended context-free grammars, and show that 
we can construct a VPA from such a grammar. 


Theorem 4. Let S be a JSON schema. Then, there exists a VPA A such that 
L(A) is the set L(S) of documents valid with regards to S. Moreover, for any 
order < of Xkey, there exists a VPA B such that L(B) = L<(S). 


Our proof does not give a construction of the grammar from the schema S. 
The grammar depends on the formal semantics of JSON schemas which are 
still changing and being debated. Thus, to be more robust to changes in the 
semantics, we prefer to learn the minimal 1-SEVPA B accepting £< (S) given 
a fixed order <, in the sense of Theorem 3.° For learning, equivalence queries 
require to generate a certain number of random JSON documents.!° If S and 
the learner’s hypothesis H disagree on a document, we have a counterexample. 
Otherwise, we say that H is correct. In both membership and equivalence queries, 
we only accept documents whose key order inside objects satisfy the order <. The 
randomness used in the equivalence queries implies that the learned 1-SEVPA 
may not exactly accept £<(S). Setting the number of generated documents to be 
large would help reducing the probability that an incorrect 1-SEVPA is learned. 


4 Validation of JSON Documents 


For this section, let us fix a schema S, an order < on Ley, and a 1-SEVPA A = 
(Q, Sison, T, ô, {qo}, Qr) accepting L< (S). We present a streaming algorithm to 
decide if a document J is in £(S). By “streaming”, we mean an algorithm that 
processes the document in a single pass, symbol by symbol. Our new approach is 
as follows. We learn A such that L(A) = L< (S). As L< (S) Æ L(S), we design an 
algorithm that uses A in a clever way to allow arbitrary key orders in documents 
to validate. To do this, we use a key graph defined in the sequel. 


Key Graph In this section, w.l.o.g. we suppose that A has no bin states. Let 
Ta be the transition system of A. We explain how to associate to A its key graph 
Ga: an abstraction of the paths of T4 labeled by the contents of the objects 
appearing in words of £<(S). This graph is essential in our validation algorithm. 


Definition 3. The key graph G4 of A has: 
— the vertices (p, k,p') with p,p' E€ Q and k © Sey if there exists in Ty a path 
(p,€) £ (p',e) with v € Epvai U {auā | a € Se, u € WM(Zyson)},"! 


We use this automaton in the next section for the validation of JSON documents. 
We do not use a 1-SEVPA for £(S) as it could be exponentially larger. 

1T¢ is common to proceed this way in automata learning, as explained in [4, Sec. 4]. 

™ Notice that each vertex (p, k, p’) of Ga only stores the key k and not the word kv. 
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Fig. 1: A 1-SEVPA for the schema from Fig.2: The key graph for the 1- 
Listing 2, without the key keywords. SEVPA from Figure 1. 


— the edges ((p1, ki, p1), (p2, k2, p2)) if there exists (p), #, p2) € ôi. 
We have the following property. 


Lemma 1. There exists a path ((pı, k1, p4). -. (Pn, kn, ph)) in Ga with pı = qo 
if and only if there exist a factor u of a word in L< (S) such that u = kivi #...# 


kntn where each kivi is a key-value pair, and a path (qo,€) = (p!,,£) in TA that 
decomposes as (pi, €) Ti, (p; £), Vi € {1,...,n} and (p;,€) B, (pPi+1,£), Vi € 
{1,...,n — 1}. Furthermore, there is no path ((p1, k1, pi). -- (Pn, En, ph )) such 
that kj = k; for some i # j. That is, G4 contains a finite number of paths. 


Hence, paths in G4 focus on contents of objects being part of JSON documents 
in L< (S). Moreover, they abstract paths in T4 in the sense that only keys k; 
are stored and the subpaths labeled by the values v; are implicit. 


Example 3. Consider the schema from Listing 2, without the key keywords. A 

1-SEVPA A accepting L< (S) is given in Figure 1. For clarity, call transitions!” 

and the bin state are not represented. In Figure 2, we depict its corresponding 
title s 


key graph G4. Since we have the path (qo,¢) ———> (q2,€) in Ty, the triplet 
(qo, title, q2) is a vertex of G 4. Likewise, (qo, name, gg) and (q7, year, q9) are ver- 


lao, (da, <)) > 


(d1i0,€), (q3, conf, qio) is also a vertex of Gy. Finally, as (qo, €) Bian (q3,€), we 
have an edge from (qo, title, q2) to (q3, conf, gio). 


name s # year i 


tices. As we have the path (q4,¢) = (qo, (q4, <)) 


Computing the key graph can be done in polynomial time by first computing 
the reachability relation Reach. From this relation, the vertices can be easily 
found. Since the edges require to check whether a transition reading # exists, it 
is obvious that it can be done in polynomial time. 


Validation Algorithm In this section, we provide a streaming algorithm that 
validates JSON documents against a given JSON schema S. 

Given a word w € SJgon \ {£}, we want to check whether w € L(S). The 
main difficulty is that the key-value pairs inside an object are arbitrarily ordered 
in w while a fixed key order < is encoded in the 1-SEVPA A (L(A) = £<(S)). 


Recall the form of call transitions for 1-SEVPAs, see Definition 2. 
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Our validation algorithm is inspired by the algorithm computing a det-VPA 
equivalent to some given VPA [3] (see Theorem 1 and its proof) and uses the 
key graph G4 to treat arbitrary orders of the key-value pairs inside objects. 

During the reading of w € Xřson \ {£}, in addition to checking whether 
we WM(jgon), the algorithm updates a subset R C Reach, and modifies the 
content of a stack Stk (push, pop, modify the element on top of Stk). 

First, let us explain the information stored in R. Assume that we have read 
the prefix zau of w such that a € Xe is the last unmatched call symbol (thus 
za € (WM(json) f Je) and u € WM(json)). 


— If a is the symbol C, then we have R = {(p,q) | (p, £) > (q, £)}. 
— If a is the symbol <, then we have u = kyu, # kava # ... kn-1Un-1 FE’ 
such that u’ € WM(jgon) and u’ is prefix of knun, where each kiv; is a 


key-value pair. Then R = {(p,q) | (p,€) = (q,€)}. 


In the first case, by using R as defined previously, we adopt the same approach 
as for the determinization of VPAs. In the second case, with u, we are currently 
reading the key-value pairs of an object in some order, not necessarily the one 
encoded in A. In this case the set R is focused on the currently read key-value 
pair kyvp, that is, on the word u’. After reading of the whole object <k,v1 # 
kovq #...>, we will use the key graph G4 to update the current set R. 

Second, an element stored in the stack Stk is either a pair (R, C), or a 5-tuple 
(R, <, K, k, Bad), where R is a set as described previously, K C Ley is a subset 
of keys, k € Xkey is a key, and Bad is a set containing some vertices of G 4.1? 

We now detail our streaming validation algorithm.'4 Before reading w, we 
initialize R to the set Iq} and Stk to the empty stack. Let us explain how to 
update the current set R and the current content of the stack Stk while reading 
the input word w. Suppose that we are reading the symbol a in w. In some cases 
we will also peek the symbol b following a (lookahead of one symbol). 


Case (1) Suppose that a is the symbol C, i.e., we start an array. Hence (R, C) 
is pushed on Stk and R is updated to Rupa = I,q,}- We thus proceed as in the 
proof of Theorem 1 (with I;,,} instead of Ig, since A is a 1-SEVPA'”). 

Case (2) Suppose that a € X; and C appears on top of Stk. We are thus reading 
the elements of an array. Hence R is updated to Rupa = {(p,q) | 3(p,q') € 
R, (q',a,q) E€ ôi}. Again we proceed as in the proof of Theorem 1. 

Case (3) Suppose that a is the symbol 3. This means that we finished reading 
an array. If the stack is empty or its top element contains <, then w ¢ L(S) and 
we stop the algorithm. Otherwise (R’,) is popped from Stk and R is updated 
to Rupa = {(p,4) | 3(p, p’) E€ R', (p', E, G0, 7) € 5c, (40,7) € R, (7,5, 7,4) € ôr}, 
as in the proof of Theorem 1. 

Case (4) Suppose that a is the symbol <. 


13In the particular case of the object <>, the 5-tuple (R, <, K, k, Bad) is replaced 
by (R, <). This situation will be clarified during the presentation of our algorithm. 
M4Note that the algorithm assumes we have a 1-SEVPA. 
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— Let us first consider the particular case where the symbol b following < is 
equal to >, meaning that we will read the object <>. In this case, (R, <) is 
pushed on Stk and R is updated to Rupa = Igq,} as in Case (1). 

— Otherwise, if b belongs to Xkey, we begin to read a (non-empty) object whose 
treatment is different from that of an array as its key-value pairs can be read in 
any order. Then, R is updated to Rupa = Ip, where P, = {p € Q | 3(p,b,p') € 
Ga}, and (R, <, K, b, Bad) is pushed on Stk such that K is the singleton {b} and 
Bad is the empty set. The 5-tuple pushed on Stk indicates that the key-value 
pair that will be read next begins with key b; moreover K = {b} because this 
is the first pair of the object. The meaning of Bad will be clarified later. The 
updated set Rupa is equal to the identity relation on P, since after reading <, 
we will start reading a key-value pair whose abstracted state in G4 can be any 
state from P,. Later while reading the object whose reading is here started, we 
will update the 5-tuple on top of Stk as explained below. 

— Finally, it remains to consider the case where b ¢ Xkey U {>}. In this final 
case, we have that w ¢ £(S) and we stop the algorithm. 

Case (5) Suppose that a € X; \ {#} and < appears on top of Stk. Therefore, 
we are currently reading a key-value pair of an object. Then R is updated to 
Rupa = {(p, 9) | Av, 7’) € R, (q',a,q) € i}. 

Case (6) Suppose that a is the symbol # and < appears on top of Stk. This 
means that we just finished reading a key-value pair whose key k is stored in the 
5-tuple (R’, <, K, k, Bad) on top of Stk, and that another key-value pair will be 
read after symbol #. The set K in (R’, <, K,k, Bad) stores all the keys of the 
key-values pairs already read including k. 


— If the symbol b following # does not belong to Xkey, then w ¢ L(S) and we 
stop the algorithm. 

— Otherwise, if b belongs to K, this means that the object contains twice the 
same key, that is, w ¢ £(S), and we also stop the algorithm. 

— Otherwise, the set R is updated to Rupa = Ip, (as we begin the reading of 
a new key-value pair whose key is b) and the 5-tuple (R’, <, K, k, Bad) on top 
of Stk is updated such that (i) K is replaced by K U {b}, (ii) k is replaced by 
b, and (iii) all vertices (p, k, p’) of G4 such that (p, p') Z R are added to the set 


Bad. Recall that the vertex (p, k, p’) of G4 is a witness of a path (p, €) m, (p', £) 
in T4 for some key-value pair kv. Hence by adding this vertex (p, k, p') to Bad, 
we mean that the pair that has just been read does not use such a path. 


Case (7) Suppose that a is the symbol >. Therefore we end the reading of 
an object. If the stack is empty or its top element contains C, then w ¢ L(S) 
and we stop the algorithm. Otherwise the top of Stk contains either (R', <) or 
(R’, <, K,k, Bad) that we pop from Stk. 


— If (R’,<) is popped, then we are ending the reading of the object <>. 
Hence, we proceed as in Case (3): R is updated to Rupa = {(p,q) | 3(p,p') € 
R', (p', <, qo, 7) € Ôc, (do, =, q) € One’ 


Notice that R does not appear in Rupa as R= Lao} 
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— If (R’, <, K,k, Bad) is popped, we are ending an object whose last seen key 
is k. As in Case (6), we add to Bad all vertices (p,k,p’) such that (p,p’) ¢ R. 
Let Valid(K, Bad) be the set of pairs of states (qo, r’) such that there exists a 
path ((p1, k1, p1) --- (Pn, kn, ph)) in Ga with pr = qo, ph = 1", (pi, ki, pi) Z Bad 
for alli € {1,...,n}, and K = {ki,...,k,}. Then R is updated to Rupa = 
{(p,q) | A(p, p’) € R’, (p', X, G0,Y) € ĉe, (qo, r) € Valid(K, Bad), (r,>+,7,q) € ôr}. 
We thus proceed as in Case (3) except that condition (r’,r) € R is replaced by 
(r’,r) € Valid(K, Bad). That way, we check that the key-value pairs that have 
been read as composing an object of w label some path in T4, once ordered by 
<. That is, the corresponding abstract path appears in G4. 


Case (8) Suppose that a € X; and Stk is empty, then w ¢ £(S) and we stop 
the algorithm. Indeed an internal symbol appears either in an array or in an 
object (see Cases (2), (5), and (6) above). 


Finally, when the input word w is completely read, we check whether the 
stack Stk is empty and the computed set R contains a pair (qo, q) with q E€ QF. 
The complexity of our algorithm is given in the following proposition. 


Proposition 1. Let S be a schema and A be a 1-SEVPA such that L(A) = 
Le(S). Deciding if a document J is valid is in time O(|J| - (|Q|* + |Q||**-v! - 
| Liey||**vl41)), and uses O(|8| + |Q]? - |Zrey| + d(I) - (IQ? + |Zreyl)) memory. 


5 Implementation and Experiments 


We present here our Java implementation of the learning process and the val- 
idation algorithm. First, we present classical validation algorithms and explain 
how to generate documents from a schema. We then explain how the required 
membership and equivalence queries are implemented. Finally, we present the 
schemas we evaluated, and the results for the learning, computation of the key 
graph, and validation experiments. The reader is referred to the code documen- 
tation for more details about our implementation [27-31]. 

In the remaining of this section, let us assume we have a JSON schema So. 


Classical Validation Algorithm and Documents Generation Let us ex- 
plain briefly the classical algorithm used in many implementations for validating 
a JSON document Jo against So [13]. It is a recursive algorithm that follows the 
constraints of So.!° For instance, if the current value J is an object, we iterate 
over each key-value pair in J and its corresponding sub-schema in the current 
schema S. Then, J satisfies S' if and only if the values in the key-value pairs 
all satisfy their corresponding sub-schema. As long as So does not contain any 
Boolean operations, this algorithm is straightforward and linear in the size of 
both the initial document Jọ and schema Sọ. However, if Sọ contains Boolean 
operations, then the current value J may be processed multiple times. 


'6Such a recursive algorithm is briefly presented in [19]. 
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In order to match the abstractions we defined (see Section 3) and to have op- 
tions to tune the learning process, we implemented our own classical validator. 
Alongside the validator, we implemented a tool to generate JSON documents 
whose structure is dictated by Sg. Due to the Boolean operations Sp can con- 
tain, it may happen that choices must be made during the generation process. 
We have two generators: a random generator that makes a choice at random, and 
an exhaustive generator that exhaustively explores every choice, thus producing 
every valid document one by one. Moreover, we implemented modifications of 
these generators to allow the creation of invalid documents, by allowing devia- 
tions." For instance, if the current schema describes an integer, we can instead 
decide to generate a string. To ensure we eventually produce a document, we 
can fix a maximal depth (i.e., the maximal number of nested objects or arrays). 
This is useful for recursive schemas, or when generating invalid documents. 


Learning Algorithm Let us now focus on the learning algorithm itself, and 
in particular on the membership and equivalence queries. We recall that the 
equivalence queries are performed by generating a certain number of (valid and 
invalid) JSON documents and by verifying that the learned VPA H and the given 
schema So agree on the documents’ validity. As said in Section 2, we use the 
TTT algorithm [9] to learn a 1-SEVPA from Sọ, relying on its implementation 
in the well-known Java libraries LEARNLIB and AUTOMATALIB [11]. 

We use the random and exhaustive generators of valid and invalid documents 
as explained above and we fix two constants C and D depending on the schema 
to be learned.'® For a membership query over a word w € Vygon, the teacher 
runs the classical validator on w and So. For an equivalence query over a learned 
1-SEVPA H, the teacher uses a generator to produce documents on which H is 
tested. If that generator is random, at each query, C documents are generated 
for each document depth between 0 and D. If none of the documents leads to a 
counterexample, the teacher checks whether Gy does not satisfy Lemma 1, i.e., 
whether there is path ((p1, k1, p1). -- (Pn, kn, pPh)) with pı = qo such that k; = kj 
for some i Æ j. In that case, we can create a counterexample. 


Evaluated Schemas For the experimental evaluation of our algorithms, we 
consider the following schemas, sorted in increasing size: (1) A schema that 
accepts documents defined recursively. Each object contains a string and can 
contain an array whose single element satisfies the whole schema, i.e., this is 
a recursive list. (2) A schema that accepts documents containing each type of 
values, i.e., an object, an array, a string, a number, an integer, and a Boolean. 
(3) A schema that defines how snippets must be described in Visual Studio 
Code [23]. (4) A recursive schema that defines how the metadata files for VIM 
plugins must be written [22]. (5) A schema that defines how Azure Functions 
Proxies files must look like [20]. (6) A schema that defines the configuration file 


1TThis is similar to mutation testing [1,12]. 
18The values of C and D are given below. 
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for a code coverage tool called codecov [21]. Hence, we consider two schemas 
written by ourselves to test our framework, and four schemas that are used in 
real world cases. The last four schemas were modified to make all object keys 
mandatory and to remove unsupported keywords. All used schemas and scripts 
can be consulted on our repository [30]. In the rest of this section, the schemas 
are referred to by their order in the previous enumeration. 

We present three types of experimental results: (1) the time and number 
of membership and equivalence queries to learn a 1-SEVPA A from a JSON 
schema, (2) the time and memory to compute the reachability relation Reach, 
and the key graph G4, and (3) the time and memory to validate a document 
using both classical and new algorithms. The server used for the benchmarks ran 
OpenJDK version 11.0.12 on Debian 10 over Linux 5.4.73-1-pve with a 4-core 
Intel) Xeon®) Silver 4214R Processor with 16.5M cache, and 64GB of RAM. 


Learning VPAs First, we learn a 1-SEVPA from a schema. We use an exhaus- 
tive generator for the first three schemas (accepting a small set of documents), 
and a random generator!’ for the remaining three for which we fix C = 10000. 
For both generators, we set D = depth( S) + 1, where depth(S) is the maximal 
number of nested objects and arrays in the schema S, except for the recursive 
list where D = 10, and for the recursive VIM plugin schema where D = 7. 

For the first five schemas, we do not set a time limit and repeat the learning 
process ten times. For the last schema, we set a time limit of one week and, 
for time constraints, only perform the learning process once. After that, we 
stop the computation and retrieve the learned 1-SEVPA at that point. The 
retrieved automaton is therefore an approximation of this schema. Its key graph 
has repeated keys along some of its paths, a situation that cannot occur if the 
1-SEVPA was correctly learned, see Lemma 1. Results are given in Table 1. 


Comparing Validation Algorithms The second part of the preprocessing 
step is to construct the key graph of the learned 1-SEVPA. For each evaluated 
schema, we select the learned automaton with the largest set of states, in order 
to report a worst-case measure. Results after a single experiment are given in 
Table 2. We can see that the storage of the key graph does not consume more 
than one megabyte, except for codecov schema. That is, even for non-trivial 
schemas, the key graph is relatively lightweight. 

Finally, we compare both classical and new streaming validation algorithms. 
For the latter, we use the 1-SEVPA (and its key graph) selected as described 
above. We first generate 5000 valid and 5000 invalid JSON documents using a 
random generator, with a maximal depth equal to D = 20. We then measure the 
time and memory required by both validation algorithms on these documents.7° 


19With the random generator, the learned 1-SEVPAS may differ each experiment. 

20Since obtaining a close approximation of the consumed memory requires Java to 
stop the execution and destroy all unused objects, we execute each algorithm twice: 
once to measure time, and a second time to measure memory. 


Validating Streaming JSON Documents with Learned VPAs 


Time (s) Membership Equivalence |Q| |X| — |6¢| |ô- | |6;| Diameter 
2.2 2055.0 5.0 7.015.0 14.0 3.0 5.0 3.0 

4.5 69514.0 3.0 24.0 20.0 48.0 3.0 26.0 12.0 

9.0 21943.0 5.0 16.017.0 32.0 7.0 18.0 13.0 
9590.3 4246085.0 36.4 150.0 27.0 300.0 2946.5 760.3 9.0 
35008.2 4063971.7 30.5 121.0 35.0 242.0 2123.0 752.5 13.3 
Timeout 633049534.0 192.0 884.0 77.0 1768.0 89695.0 8557.0 28.0 
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Table 1: Learning results. For the first five schemas, values are averaged out of 


ten experiments. For the last schema, a single experiment was conducted. 


Reach, Ga 
Time (s) Memory (kB) Size Time (s) Computation (kB) Storage (kB) Size 
34 492 31 100 2231 65 3 
67 1152 213 234 2623 69 9 
67 737 125 118 2223 69 10 
1756 10316 5832 1715 11827 419 418 
2208 13978 4420 2839 17968 667 541 
377141 212970 270886 187659 120398 16335 6397 


Table 2: Results for the computation of Reach 4 and G4. The Computation (resp. 
Storage) column gives the memory required to compute G 4 (resp. to store Ga). 
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Fig. 3: Results of validation benchmarks. 
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On all considered documents, both algorithms return the same classification 
output, even for the partially learned 1-SEVPA. 

For our algorithm, we only measure the memory required to execute the 
algorithm, as we do not need to store the whole document to be able to process 
it. We also do not count the memory to store the 1-SEVPA and its key graph. As 
the classical algorithm must have the complete document stored in memory, we 
sum the RAM consumption for the document and for the algorithm itself. This 
is coherent to what happens in actual web-service handling: Whenever a new 
validation request is received, we would spawn a new subprocess that handles 
a specific document. Since the 1-SEVPA and its key graph are the same for all 
subprocesses, they would be loaded in a memory space shared by all processes. 

Experimental results indicate that our algorithm exhibits good performance. 
Results for the three smaller schemas are not presented here to save space, 
while they are given in Figure 3 for VIM plugins, Azure Functions Prozies, and 
codecov. The blue (resp. red) crosses (resp. circles) give the values for our (resp. 
the classical) algorithm. The x-axis gives the size of each (abstracted) document. 

For both VIM plugins and Azure Functions Proxies, our algorithm consumes 
less memory than the classical one. For these benchmarks, memory and time 
usage seemingly trade off as we see that our algorithm usually requires more 
time to validate a document — a majority of that time is spent computing the 
set Valid(K, Bad). This tradeoff, however, does not hold in general: There are 
schemas for which our algorithm performs better than the classical one, both 
in terms of time and memory, as it does not have to backtrack to validate a 
document, which reduces the time and memory space required. 

For the codecov schema, we recall that the learning process was not com- 
pleted, leading to an approximated 1-SEVPA with repeated keys in its key graph. 
This means that the computation of Valid(K, Bad) explores some invalid paths, 
increasing the memory and time consumed by our algorithm. Thus, it appears 
that, while a not completely learned 1-SEVPA can still be used in our algorithm, 
stopping the learning process early may increase the time and space required. 


6 Future Work 


As future work, one could focus on constructing the VPA directly from the 
schema, without going through a learning algorithm. While this task is easy 
if the schema does not contain Boolean operations, it is not yet clear how to 
proceed in the general case. Second, it could be worthwhile to compare our 
algorithm against an implementation of a classical algorithm used in the industry. 
This would require either to modify the industrial implementations to support 
abstractions, or to modify our algorithm to work on unabstracted JSON schemas. 
Third, in our validation approach, we decided to use a VPA accepting the JSON 
documents satisfying a fixed key order — thus requiring to use the key graph and 
its costly computation of the set Valid(K, Bad). It could be interesting to make 
additional experiments to compare this approach with one where we instead use 
a VPA accepting the JSON documents and all their key permutations — in this 
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case, reasoning on the key graph would no longer be needed. Finally, motivated 
by obtaining efficient querying algorithms on XML trees, the authors of [26] have 
introduced the concept of mixed automata in a way to accept subsets of unranked 
trees where some nodes have ordered sons and some other have unordered sons. 
It would be interesting to adapt our validation algorithm to different formalisms 
of documents, such as the one of mixed automata. 


Data-Availability Statement. The source code and experimental results that 
support the findings of this study are available in Zenodo with the identifier 
https: //doi.org/10.5281/zenodo.7309690 [31]. 
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Abstract. We define novel algorithms for the inclusion problem between 
two visibly pushdown languages of infinite words, an EXP'TIME-complete 
problem. Our algorithms search for counterexamples to inclusion in the 
form of ultimately periodic words i.e. words of the form uv” where u and 
v are finite words. They are parameterized by a pair of quasiorders telling 
which ultimately periodic words need not be tested as counterexamples 
to inclusion without compromising completeness. The pair of quasiorders 
enables distinct reasoning for prefixes and periods of ultimately periodic 
words thereby allowing to discard even more words compared to using the 
same quasiorder for both. We put forward two families of quasiorders: the 
state-based quasiorders based on automata and the syntactic quasiorders 
based on languages. We also implemented our algorithm and conducted 
an empirical evaluation on benchmarks from software verification. 


1 Introduction 


Visibly pushdown languages [4] (VPL) have applications in various domains 
including verification [22], theorem proving [27] or XML schema languages rea- 
soning [26] where the inclusion problem plays a crucial role. For instance proving 
correctness relative to a specification reduces to a language inclusion problem 
and so does proving correctness of a theorem of the form VatyP(7%) = > Q(y). 
The extension to the case of visibly pushdown languages of infinite words (w- 
VPL) has also been studied in the context of program verification [21] and it 
has applications in word combinatorics [23,25,27]. 

We distinguish two general approaches to solve the language inclusion prob- 
lem L C M: (i) complement M, intersect with L and check for emptiness of the 
result; and (ii) reduce the inclusion check to finitely many membership queries 
asking whether w € M holds where w € L and each query aims at finding a 
counterexample to inclusion. 


* This work was partially funded by the ESF Investing in your future, the RYC-2016- 
20281/MCIN/AEI/10.13039/501100011033, the Madrid regional government as part 
of the program $2018/TCS-4339 (BLOQUES-CM) co-funded by EIE Funds of the 
European Union, the PRODIGY Project (TED2021-132464B-100) funded by MCIN 
and the European Union NextGenerationEU/ PRTR. 


© The Author(s) 2023 
S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 290-307, 2023. 
https: //doi.org/10.1007/978-3-031-30823-9_15 
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In this paper we focus on the second approach. Previous work in that space 
leverage relations between words to select a finite subset of words of L on which 
we run the membership queries. A class of relations that consistently yields 
good results in practice are quasiorders which discard words subsumed (for the 
quasiorder) by others. A key feature of such quasiorders is that the subset of L 
selected via the quasiorder must contain a counterexample to inclusion if there 
exists one. Quasiorders are a versatile heuristic that has been applied to inclusion 
problems for languages such as languages of finite words [3,10,14] (including 
visibly pushdown language [6]) or infinite words [1,2,12,13,16,24] and even tree 
languages [3,5]. Algorithms leveraging quasiorders are commonly referred to as 
antichains algorithms. Subsequent improvements (e.g. [2] improving [1]) often 
attempt at defining coarser quasiorders because they enable the selection of an 
even smaller subset of L. 

Let us now turn to the inclusion problem between w-VPL, an EXPTIME- 
complete problem. For that problem the selection of words of L is limited to 
ultimately periodic words, i.e. words of the form wv”, where u and v are called 
prefix and period respectively. For an ultimately periodic word wv” subsumption 
(for a quasiorder) simply means subsumption of (u, v) relative to a pair S, x S, 
of quasiorders on finite words. The quasiorders found in the literature [17,18] are 
all equivalences and are all such that S, = Sp. 

In this paper, we propose a new family of algorithms for the inclusion prob- 
lem between w-VPL that leverages a subset of the ultimately periodic words, 
deemed legitimate decompositions and is parameterized by a pair of quasiorders 
and a decision procedure for the membership queries in M. We identify prop- 
erties that such pair of quasiorders must satisfy so that the resulting algorithm 
actually decides the inclusion problem between two w-VPL: (1) be decidable; 
(2) be well-quasiorders; (3) verify some monotonicity conditions w.r.t. word op- 
erations that are characteristic to w-VPL and (4) satisfy a preservation property 
intuitively saying that a legitimate decomposition inside M cannot subsume a le- 
gitimate decomposition outside of M. We put forward two families of quasiorders 
satisfying (1) thru (4): the state-based quasiorders whose definition rely on a vis- 
ibly pushdown automaton underlying M and the syntactic quasiorders whose 
definition is based solely on M. The syntactic orders are the “ideal” quasiorders 
in the sense they are the coarsest, hence they select the “smallest” subset of L. 
None of our quasiorders is symmetric, hence they are coarser than equivalences 
and in each and every pair we define the quasiorder on prefixes differs from the 
one on periods (i.e. S, # S,). We further prove that when instantiated with 
the state-based quasiorders and with a state-based decision procedure for mem- 
bership queries the resulting algorithm, which we call the state-based algorithm, 
has a runtime that matches the corresponding problem complexity. 

Finally we implement the state-based algorithm and evaluate it on various 
benchmarks collected from Friedmann et al. [18] and from SV-COMP?, the Soft- 
ware Verification competition. The empirical evaluation is carried out against 
Ultimate [21] which follows a complement, intersect and check for emptiness 


3 https: //sv-comp.sosy-lab.org 
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approach. The preliminary conclusion of the empirical results is in favor of our 
approach as it scales up better. 


Related Work. Bruyere et al. [6] proposed an antichain algorithm for the in- 
clusion of VPL but they only tackle the problem for languages of finite words. 
The same limitation applies to Ganty et al. [19,20] where, moreover, they do 
not tackle the inclusion problem of VPL into VPL (the closest they tackle is 
CFL into regular). The extension from the finite to the infinite case was tackled 
in Doveri et al. [13] but they do not cover the case w-VPL into w-VPL (the 
closest they tackle is w-CFL into w-regular). Friedmann et al. [17,18] do tackle 
the w-VPL into w-VPL problem. However they do not leverage the full power 
of quasiorders (they use equivalence instead); they do not use distinct pruning 
techniques for prefix and periods; and they do not put forward syntactic qua- 
siorders. A summary comparing our work (omegaVPLinc) with the closest works 
in the area is given at Table 1. 


Table 1. Comparison of the closest work in the area based on the characteristics of the 
problem tackled (first two columns) and the techniques used (last three columns). N/A 
means non applicable, O means no support and @ means full support. The labels w, 
VPL, qo, S$, # S, and syntactic qo ask respectively whether the work thereof tackles 
the problem of infinite words, tackles the problem of VPL, leverage quasiorders, defines 
distinct quasiorders for prefixes and periods, and defines syntactic quasiorders. 


w |VPL||qo|S, 4 S,|syntactic qo 
Bruyere et al. [6] O® J@n O 
Ganty et al. [20] OO ||@ |N/A (J 
Doveri et al. [13] ©@O ee O 
Friedmann et al. [18]|@|@ OJO O 
omegaVPLinc ee 09 [7 


2 Background 


Fix X ê 5;US,US; an alphabet (a finite non empty set of symbols) comprising 
three disjoint alphabets. The set of finite words and the set of infinite words over 
X are denoted by X* and X® respectively. We denote by € the empty word and 
define X+ = X*\{e}. Given a word u = ugur -+ € Y*UL” we say that a position 
j where j € N, j < |u| and |u| E€ NU {w} is the length of u, is an internal (resp. 
call, resp. return) position if uj € X; (resp. uj € Xe, resp. uj € Xr). 


Visibly Pushdown Languages. A Visibly Pushdown Automaton (VPA) over 
X is a tuple A = (Q, qr, I, ô, F), where Q is a finite set of states including an 
initial state qr € Q, F C Q is the set of final states, I is the stack alphabet 
including a bottom-of-stack symbol L and 6 = 6; U ĝe U ôr consists of three 
transition relations 6; C Q x Xi x Q, 6. C Q x Xe x Q x F\{L} and 6, C 
Q x Xx Ix Q. Configurations in A are pairs in Q x I*. For a € X we define 
the relation F“ between configurations as follows: 
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Ifa € X; and w € I* we have (p, w) F° (q, w) if (p,a,q) € ĝi. 

If a € Xe and w € I™ we have (p, w) F° (q, wy) if (p,a, q, Y) € ĝe- 

— Ifa € Xp, y E TUL} and w € I* we have (p, wy) F° (q, w) if (p,a, Y,q) € ôr. 
— Ifa € X, we have (p, L) F° (q, L) if (p,a, L, q) € ôr. 


We lift the relation F to words by transitivity and reflexivity, that is, for all 
u € X*, (q,w)F*" (p, w’) when the configurations (q, w) and (p, w’) are related 
by a sequence of transitions such that the concatenation of the corresponding 
labels is the word u. We write (q,w)®" (p, w’) when such a sequence includes a 
configuration whose state is final. A trace of A on a infinite word € = agai ++- € 
© is an infinite sequence (qo, wo) °° (q1, w1) F“? +--+ It is a final trace when 
qj € F for infinitely many j’s. It is an accepting trace when it is a final trace 
and (qo, wo) = (qr, L). The w-language accepted by A is LY (A) = {€ € SY | 
there is an accepting trace of A on €}. A language L C X* is w-VPL if L = 
L” (A) for some VPA A. Two examples of VPA are given at Fig. 1, A has an 
accepting trace on ccr crcr... and so does B on crrcrr... 


(B) c/A 
(A) c/A r/A © O c/A r/A 
Pf) Ak. 


Fig. 1. Two w-VPA with I = {A, L}, ©; = 0, Xe = {c} and X, = {r}. 


Ultimately Periodic Words. An ultimately periodic word is an infinite word 
€ € XY such that € = wv” for some finite prefix u E€ X* and some finite period 
v € Xt. We call the couple (u,v) € X* x Xt a decomposition of £. Note that € 
admits infinitely many decompositions. 

Ultimately periodic words play a central role in our approach as they suffice 
for the inclusion problem as shown by the following theorem. + 


Theorem 1. Let L,M C X® be w-VPL. Then, LC M iff Vuv’ € L, uv” € M. 


Matching Relation. The partition of the alphabet X = X; U Xe U X, induces 
a unique matching relation between a word’s call and return positions (see [18]). 
Given u € Y*U™ define the matching relation of u, denoted ^u, as the unique 
relation on its call and return positions such that for every j u k we have 
0<j<k < |ul, uj © De, Uk E Xr, Hn | j Ou n}| <1, Hn | n œu k}| < 1 and 
there are no j’,k’ with j/ œu k' and j < j' < k < k’. Given j ^u k we say that 
j and k are matched positions. A call (resp. return) position j in u is unmatched 


4 Theorem 1 can be easily obtained by adapting the proof of Fact 1 in [7]. 
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if j Ay k (resp. k œu j) for no k. Furthermore, for every unmatched positions n 
in u there is no j A, k such that j < n < k, and if un € Xe (resp. un E€ Xr) then 
there is no unmatched return (resp. call) position k with n < k (resp. k < n). A 
word is said to be well-matched if it has no unmatched position. 


3 Foundations 


In this section we outline our approach which, given a VPA A = (Q,qz,T,ô, F) 
and an w-VPL M, reduces the inclusion problem L“ (A) C M to finitely many 
membership queries in M. More precisely, we derive a finite subset Sgnite of 
ultimately periodic words of L” (A) such that 


L*(A) C M <=> V(u,v) € Spite, uv” € M . (t) 


Reduction to Legitimate Decompositions. Our first step is to reduce the 
inclusion check to a subset of ultimately words of L“(A) given by legitimate 
decompositions. To do so, we define W as the set of well-matched finite words, 
C (resp. R) as the set of finite words where all call (resp. return) positions are 
matched and Ue as the set of finite words with at least one unmatched call po- 
sition. In turn, we define the set of legitimate decompositions given by 


Ld=cxcUU.xR 
which, as shown next, is sufficient for the inclusion problem between w-VPL. 


Theorem 2. Let LLM C X” be w-VPL. Then, L C M iff V(u,v) € La, 
uv" Ee L = w” eM. 


Next we leverage the relations H* and F® of A to characterize the legitimate 
decompositions of the ultimately periodic words of LY (A). We start by defining 
the following languages of finite words for each pair p,q € Q of state of A: 
Lyq = {u € X* | 3w er, (p, L)” (q,w)} and L®, = {u € X+ | w € 
I*, (p,L) -®" (q, w)}. Finally, define the following subset of Ld: 


A ® ® 
= Une Lar pic x Lo. pic U La py, x Lpp 
where L is defined to be LM K to emphasize that L is restricted to K. 


Example 1. Consider the VPA A and B depicted in Fig. 1. We have L” (A) = R”, 
S = (W x W\{€}) U (R\C x R\{e}) and L” (B) = (W{epr)”. 


Proposition 1. We have that uv® € LY(A) = > A(u’,v’!) E€ S, wwe = uwv". 


By Theorem 2 and Proposition 1 the subset S' verifies: 
L”(A) CM <=> Yu, v) E S, ww” EM. (1) 


Next we reduce the inclusion check to a finite subset of S using quasiorders. 
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Reduction to a Finite Basis. A quasiorder (qo) on a set E, is a reflexive and 
transitive relation x C E x E. Given two subsets X,Y C E the set Y is said to 
be a basis for X with respect to x whenever Y C X and Vz € X, Jy E€ Y,y x x. 
A qo x is a well-quasiorder (wqo) if every subset of E admits a finite basis. 

We obtain Sfnite as a finite basis for S with respect to < x = for a pair <, = 
of wqos.° To guarantee the direction = in Eq. (t) we need the pair <, x to be 
M-preserving, a notion we introduce below. 

A pair <, x of qos on X* is said to be M-preserving if for all (u,v), (u’,v’) € 
Ld such that (u,v), (u’,v’) E€ C x C or (u,v), (u’,v’) E€ Ue X R, 


if uwv“ € M,u < u’ and v 3 v’ then uv € M. 


Intuitively, M-preservation guarantees that if the inclusion does not hold then 
the finite basis Sgnite contains a counterexample. 

Next, we fix a pair of M-preserving wqos <, and show the existence of a 
subset Sgnite such that Eq. (t) holds. Since < x x is a wqo, there exist two finite 
bases Sı and S2 for Sjcxc and Sju.xr respectively w.r.t. < x =. We define Sfnite 
to be the union of such sets S1, S2, viz. Sgnite & S1 U S2 C S. We have that: 
V(u,v) E S, uw’ € M => V(u,v) © Sfnite, wv’ E€ M. We now turn to the 
converse implication. Assume that V(u,v) E€ Sgnite, wv” E€ M. Let (u,v) € S. If 
(u,v) € Sicxe then there is (uo, vo) € S1 such that (uo, vo) < x ~ (u,v). Since 
Sı C Sicxe E C x C we have that (uo, vo), (u,v) E€ C x C. Since uov € M and 
the pair <, is M-preserving, we conclude that wv’ € M. The case (u,v) € 
Siy.xrx proceeds analogously. It follows that V(u,v) € S, uv” € M <= Y(u, v) € 
Stnite, UVY € M. Hence, we derive Equation (t) using Equation (1). 

In Section 4, we give a fixpoint characterization of S and in Section 5 we 
show that under some monotonicity conditions on the wqos < and = we can 
effectively compute a finite basis for S. We then give two examples of monotonic 
pairs of wqos in Section 6. In Section 7 we present our algorithm which given 
two VPA A and B decides the inclusion problem L“ (A) C L“ (B). Therein we 
discuss the state-based algorithm and give an upper bound on its running time. 
Finally in Section 8 we report on an empirical evaluation. 


4 Fixpoint Characterization 


In this section we give a least fixpoint characterization of S for the VPA A = 
(Q,q1,I, 6, F). To this end we work with the complete lattice (p(E*) P12? c x 
--»x C), where n € {4,6} and each Cartesian product consists of n-|Q|? factors. 

For a function f: E —> E on a quasiordered set (E, x) and for all n € N, 
we define the n-th iterate f” : E > E of f inductively as follows: f° = Av.2; 
fett £ fo f”. The denumerable sequence of Kleene iterates of f starting from 
the bottom value L € E is given by {f"(L)}nen. Recall that when (F, x) is 
a complete lattice and f: E —> E is a monotone function (i.e. dx d => 


5 The qo < x = is a wqo when both < and g are wqos. 
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f(d) x f(d’)) then by the Knaster—Tarski theorem, f has a least fixpoint lfp f 
given by the supremum of the ascending® sequence of Kleene iterates of f. 

Given a n-|Q|?-dimensional vector X and a |Q|?-dimensional vector Y on 
y(&™*) we write Xi p,q, for the (i, p,q)-component of X and Yp, for the (p, q)- 
component of Y. We define the following equations where X, X’ € p(w), 
Y, Y’ €p(c)lel’, Z, Z! € pR)’, and T € g(u,)I2I’ : 


W(X) = (Lpa (s;u{e}) Y U eX gr UU Xp gt X apaeg 
(p,c,p" vy) Ede, qd €Q 
(a! .r,7,4) €or 


C(X,Y) = (Lpa |5, U Xp, U LJ YoY apata 


qd'EQ 
R(X, Z) = (Lpa Se U Xp,q U U Zp,a! Zaq) p.qeQ 
d'EQ 
UM, Z,T) = (Lpays, Y U Yp,p! Tp! .q! Za’ ,q)P.1EQ 
p'a EQ, 
Ae @ 1 1 1 
We (X, X") = (Login, Y U eXprar O U eX gtr Y U a Xa UX p,q? Ky aN aR 
(p,ep, y)Ec;, (P,e, p',Y)ESc, d'EQ 
(a'r, y,a)Eôr,  (a',r,y,a)Eðr 
{p:a} nF AO {p,q}nF=0 
1 Pe a ® 1 1 1 
Cex, Y,Y )= (pa on U Xp, U U Opa Ya'a U Yp,a' Ya’ ,a)) p aEQ 
qg'EQ 


, Pe ® i 1 į 
R@(X', Z, Z’) = (Eo ale. U Xp GY U (Zoa Zaig YU Zp Zaa) padca - 


q’/eQ 


The equations W, C, R and U are used to obtain the set of words in W, C, R 
and U, respectively, that connect two configurations of A. The equations Wa, 
Ce and Re refine those of W, C and R by filtering out words not visiting final 
states. In turn we define the functions f4 and r4 used to obtain the prefixes u 
and the periods v respectively for the decompositions (u,v) € S. Define 


Leelee’ — payee 
(X,Y, Z,T) > (W(X), C(X,Y), R(X, Z), UY, Z,T)) 
for the prefixes, and for the periods define 
aep E —s p( Esler 


(X,Y, Z, X’, Y', Z) — (W(X),0(X,Y),R(X,Z),We(X,X"),Ce(X4Y,Y'),Re@(X',Z,2’)) . 


The function f4 (resp. r4) is monotone and the supremum of the ascending se- 
quence of its Kleene iterates starting at the bottom value Ø £ (0,...,) of dimen- 
sion 4-|Q|? (resp. 6- |Q|?) is the vector (Aw, Ajc, AR, Aju.) (resp. (Aw, Ac, Apr, 
Aw Aie AR) where Ajs = (Lp,q)3)p,aeq and AN = (Lp .q)3)P.9€Q for JE {W,C,R, Uc}. 
Therefore, by the Knaster—Tarski theorem we obtain the following proposition. 


Proposition 2. lfp fA = (Av Alec, Alp; Ay.) and lfp TA = (Av Ale, Alp AY, 
AS, A®). 


[e 


ê A sequence {sn}nen € E on an ordered set (E, x) is ascending if for every n € N 
we have Sn X Sn+1- 
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Finally, by Proposition 2, we obtain the desired fixpoint characterization of S: 


S= Upea (aao x (lfp A ess) U (EDTA bse x (lfp ra)o)) . (2) 
Example 2. We derive from the VPA A depicted in Fig. 1 the following functions 


W(X) 4 {eh UcXrUXX, COLY)S XUYY, 
R(X, Z) = {ep} UX UZZ, U(Y,Z,T) {0} UYTZ. 


Hence, we obtain the function 


fa: (5+) — w(2*)? 
(XYZ T) (W(X), C(X,Y), R(X, Z), UZT . 


The first three iterates of the least fixpoint computation of lfp f4 are given by 


FAD) = ({e}, 0, {c}, {e}), 
fa?(0) = (fe, cr}, {e}, fee, 7}, {c}), 


=> 


fA | j= ({e, cr, or (cr)? ps eet ha COP, Oe eh {c, ec }) 


lfp fa = (W, W, R, R\C) 


Since the unique state of A is a final state we have that Lq,,¢, = LẸ 4,- Conse- 


quently, the function f4 suffices to describe both the set of prefixes and the set 
of periods of S given by ((lfp f.4)2 x (lfp fia)2\{e}) U (fp fa)a x (lfp fa)s\{e}). 


Each (i, p, g)-component of the Kleene iterates of f4 and ra keeps a finite set 
of words. However, if the language L(A) is infinite, the fixpoint computations 
of lfp fa and lfpr4 do not terminate in a finite number of steps. Nevertheless, 
under some monotonicity assumptions on our wqos we show in the following 
section that we can compute a finite basis for S w.r.t. < x x as a terminating 
fixpoint computation. 


5 Monotonicity Requirements 


In order to detect finite bases among the Kleene iterates of the functions defined 
in the previous section we replace the set inclusion on p(X*), used so far, with 
the qo Ex C p(5*) x p(5*) defined by X Cy Y 45 Va € X, Jy € Y,y x a. 
The qo Ex leverage the notion of basis: given X € p(X*) a subset Y C X isa 
basis for X with respect to x whenever X Cy Y 


In the following we lift the notion of basis to n-dimensional vectors component 
—n-|Q|? ) 
=K 


wise and work with the quasiordered sets (p(E*yr le? , where n € 


2 
{4,6} and the ordering C”'@! is given by the product Ey x +++ x Cy of n- |Q]? 
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factors. Given a pair <, < of wqos, the orderings celery nd C Ei lal’ are used to 
compare the Kleene iterates of the functions fA and r A H wely: For them 
to be apt to detect finite bases for the least fixpoints of these functions the qos 
< and = need to verify some monotonicity conditions. 

We introduce the monotonicity conditions W,C,R,Ce,Re and U on a qo 
x C &* x X* as follows: for all u,u’ € X* such that u x u’ 


(W) ifu,u EW andce yr € X, then cur x cur, 
(C) ifu,u' EC andsEC te X* then sut x su't, 
(R) ifu,u' ER andsEX*tER then sut x su't, 
(U) ifu,u EUcandsEC ‚ER then sut x su't, 
(Ce) ifu,u’ EC andsEC tec then sut x su't, 
(Re) ifu,u ER andsER ,‚tER then sut x su't. 


A pair of qos <, x is monotonic if < verifies W, C, R, U and x verifies W, Ce, Re. 
Proposition 3. Let <, be a pair of wqos. There is a positive integer n such 


that fati JES M fao () (resp. ra” +1 (0) E = A” (O 0)); and, if the pair of 


: —4:\Q\? » nix —6-|Q|? mt 
wqos is n: then lfp fa E< fa” (0) (resp. lfp ra by A” (0)). 


Each Kleene iterate of f4 and r4 is computable and given a decidable qo x 
on X* and two finite sets X,Y C X* it is decidable whether X LC, Y holds. 
Thus, given a monotonic pair <, =< of decidable wqos, by Proposition 3, we can 
compute a finite basis for lfp fa w.r.t. < and a finite basis for lfp r4 w.r.t. =. 
Hence, by Equation (2) we can compute a finite basis for S w.r.t. < x =. 


6 Quasiorders for w-VPL 


In the following we present two families of qos to solve the inclusion problem 
L” (A) C M, the state-based qos which are derived from a VPA-representation of 
M and compare words according to the set of configurations each word connects 
in the VPA, and the syntactic qos which rely on the syntactic structure of M. 
We say that a pair of qos is M-suitable if it is an M-preserving and monotonic 
pair of decidable wqos. Intuitively, if a pair of qos is M-suitable then it can be 
used in our algorithm to decide the inclusion LY (A) C M. 


State-based Quasiorders. Given a VPA 6 = (Q, dr, 1’, 6, F) we associate with 
each word u € X* its context ctx®[u] and final context ctx8[u] in B as follows: 


ctx? [u] £ {(p, 4) € Ê? | 3w € I™, (p, L) F*™ (q, w)}, 
ctx§ [u] = {(p,q) € Q? | Iw € I, (p, L)F®" (q,w)} . 


Hence we define the following qos on words in X*: 


u [Ë u! & ctx” [u] C ctx® [u], u Bu 4 u <Ë u! A ctx8 fu] C ctx8 fu] . 


Antichains Algorithms for the Inclusion Problem Between w-VPL 299 
Proposition 4. Let B be a VPA. The pair <8, <® is L*(B)-suitable. 


Example 3. Consider the pair of qos <Ë, 8 derived as explained above from 
B (Fig. 1) and the set S = (W x W\{e}) U (R\C x R\{e}) from Example 1. We 
have that ctx®[e] = {(p, p), (a,4)}, ctx8 le] = {(,p)}, ctx? [u] = {(p. a), (4,a)} 
and ctx8[u] = {(p,q)} for every u € R\{e}. We have that {c} is a basis for R\{e} 
w.r.t. <8 since c <® u for every u € R\{e}. Since R\C C R\{e} and {c} C R\C 
we deduce that {c} is also a basis for R\C w.r.t <8. Similarly we deduce that 
{e, cr} is basis for W w.r.t <8 and that {cr} is a basis for W\{e} w.r.t. <8. Hence, 
({e, er} x {er}) U ({c} x {c}) is a basis for S w.r.t. <8 x <8. 


Syntactic Quasiorders. Given a w-VPL M we associate with each word u € 
X* its context ctx™ [u] and final context ctx¥ [u] in M as follows: 


ctx™ [u] £ {(s,€) € X* x EY | suf € M}, 


{ 
ctx [u] £ {(s,t) € 2* x E* | s(ut)” € M} . 

At first glance, we are tempted to define the syntactic qos from ctx™ and ctx’ 
in the analogue way we defined the state-based qos from the contexts and fi- 
nal contexts relatively to a VPA. Although, this definition provides a pair of 
M-preserving qos, it does not guarantee that the pair is M-suitable. To over- 
come this, we impose the respect of the partition P £ {W,C\W,R\W,U,\R} of 5*, 
meaning that two words compare only if they belong to a same subset of P. Ad- 
ditionally, given J € P we compare two words of J by considering a restriction 
of their context and final context in M which depends on J. More precisely, we 
define the qo <™ on X* as the union Usep <M where for every J € P, the qo 
<™“ C Jx J is defined by 


us” ul <> ctx? 


xu, 


u] C 
o C ctx™ [u'JIcx ze, 
js 
| 


eS ul <> ctx” 


u<M 


M 
e <5 ctx uljsexre C ctx” [u] r+ xRe, 


“| 
[ 
Mi 
TES x i u! <> ctx? “Tu loxre © ctx™ [u u'ljoxre - 


Similarly, we define the qo 4% £ Les <™ on X* where for every J € P, 
3% C J x J is the qo defined by 


u gM SousM u A ctx [ul C ctx! fu], 


u sku usu <u u A (ctx [ulicxe E ctx [u’]\exe)s 
u ay u <u Sayu u’ A (ctx& [lu]5=xr E ctx4 [u [u ‘I1=*xr) 


U<p\_ ul <> uu E U\R . 


Proposition 5. Let B be a VPA. The pair <” 8), 44° (B) is L” (B)-suitable. 
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Proof (sketch). First we show that the pair <™”,<™ is M-preserving, where M £ 
L” (B). Let (u,v), (u’,v’) € © x C (resp. Ue X R) such that u <™ wu’, v <™ v' and 
uv” € M. From u <™ u’ and wv” € M we deduce that (e, v”) € ctxt solu] C 
CtX oy yo [u’] (resp. (e, v”) € CRT ae fu] € te as [u’]). Thus, wv” € M. From 
v <™ v and u” € M we deduce that (u’,e) € ctxt [v]jexe E ctxd[v'Jicxe 
(resp. (u’, €) € ctx [uJ sxe C ctxt [v] sxc). Thus, uv € M. 

We now show that the qo <™ satisfies the monotonicity conditions C and R. 
Let u <™ w’ such that u, u’ € C (resp. u, u’ € R). Let s € C and t € X* (resp. s € 
X* and t € R). If u,u’ € W then it is easy to check that sut <™ su't. Otherwise 
u, u’ € C\W (resp. u, u’ € R\W) and we distinguish two cases: if t € C (resp. s € R) 
then sut, su't € C\W (resp. sut, su't € R\W). We show that sut <u su't (resp. 
sut Saiu su't). Let (s',€) € ctx™ [sutlicx sw (resp. (s’,€) € ctx™ [sut] s» xro )- 
Since s’s € C (resp. tE € R”), we deduce from u Sau u’ (resp. u Siw u’) 
that (s’,€) € ctx™ [su’t]icx se (resp. (s’,€) € ctx™ [su't] s+ xro). If t € Ue (resp. 
s € X*\R) then sut, su't € U.\R and similarly we can show that sut Siig su't. 
The proof that <” and <™ are wqos follows from [9, Prop 1.2] by observing 
that for every J in the partition P of X* we have ee C <M and <x CM, 
where <8 and =® are the state-based qos previously defined. 


Deciding the syntactic qos can be easily shown to be as hard as the inclusion 
problem between w-VPL generated by VPA. Nevertheless, the syntactic qos act 
as a gold standard for quasiorders in the sense formalized in the next proposition. 


Proposition 6. Let M C X* be an w-VPL and <,=< be a M-suitable pair of 
gos such that x C <. For every JEP we have S| 3x7 <™ and <x7 S we 


By Propositions 5 and 6 the pair <4” (8), <4" (®) is the greatest (w.r.t C x C) 
among the L“(B)-suitable pairs <, of qos that respect the partition P and 
that verify < C <. 


7 Algorithm 


We are now in position to present our algorithm which, given two VPA A = 
(Q,qr,T,ô, F) and B = (Q, dr, Î,ô, Ê) and a pair of L” (B)-suitable qos, decides 
the inclusion problem L“ (A) C L” (B). 

Algorithm 1 computes a finite basis for S w.r.t. < x x (lines 1-2) and af- 
terwards checks membership in L“ (B) on every ultimately periodic word wv” 
stemming from this finite basis (lines 3-7). 


Theorem 3. Given the required inputs, Algorithm 1 decides the inclusion prob- 
lem L” (A) C L” (B). 


Proof. As established by Proposition 3, given a monotonic pair <, < of decidable 
wqos, Algorithm 1 computes in line 1 (resp. line 2) a finite basis f4™ (Ø) (resp. 
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Algorithm 1: Algorithm for deciding L” (A) 
Data: VPA A = (Q, qr, T, ô, F) and B = (Q, qr; 
Data: L”(B)-suitable pair <, =. 
Data: Procedure deciding wv € L” (B) given (u,v). 
Compute fa" (0) with least m s.t. fat) eer fa” (0); 
Compute r.4” (0) with least m’ s.t. ra™ +1 (9) cele? ra™ (0); 
foreach p € Q do 
foreach u € (fa™ 
if uv” ¢ L° (B 
foreach u E (fa™ 
if uv” ¢ L” (B 
return true; 


ji 


0)) 2,41.» v € (ra™ (Ø))s,p,p do 
then return false; 


1) eee v € (ra™ (0))6,r,» do 
then return false; 


@ 
jt 
@ 
jt 


arNrtan#4kk OW NY 


ra™ (0) for lfp f4 (resp. lfp r4) w.r.t. < (resp. =). Next define: 


prum 4 Upeo (fa) 2.41.0 x (ra™ (D)s,p,p) U ((ta™ 8) 4,47. X (TA m B))6.0.p)) i 


mM ym! 


Using Equation (2) we deduce that S% is a finite basis for S w.r.t. < x x. 
Since the pair <, < is L”(B)-preserving, by Section 3, we deduce that 


L” (A) C L*(B) <=> V(u,v) € SY? m™ we L*(B) . 


We remark that Algorithm 1 can be easily adapted to decide the inclusion prob- 
lem between visibly pushdown languages of finite words. The adaptation to the 
finite words case omits the fixpoint computation of line 2 and iterates over the 
components (i, qr, p) where i € {2,3,4} and where p € F is a final state. 


Example 4. oe the iterates of the function f A from Example 2. One can 
check that f44(0) E Cis f.43(0) (thus also f44(0) E Cés fa 3(9) since <8 C <8). 
Thus, we check whether the inclusion LY (A) C L”(B) holds on the finite set 
({e,er} x fer}) U ({e,c?, c} x {er,c,c?,c3,c*}) and find the counterexample 


e(er)® € LY (A)\L(B). 


Antichains Everywhere. We show next that Algorithm 1 remains correct if, 
in the sequence of Kleene iterates of f4 or r4, for each application of fA or ra 
we first select a finite basis for their arguments instead (using <41? for f4 and 
“lel? for ra). 

Proposition 7. Let x be a qo that verifies the monotonicity conditions W, C, 
R, U. If B is a basis for (X,Y, Z, T) € p(WI2l x p(0!Q? x o(R)I@? x p(U)!Q" 
w.r.t. x tlQI? then f4(B) is a basis for fA(X,Y, Z, T) w.r.t. x41. The analogue 
result holds for r4 when x satisfies the monotonicity conditions W, Ce, Re. 
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Since every Kleene iterate of f4 belongs to (wy lel? x e(c)lel? x (Rye? x 
e(U,)!@!" given a basis B for f4"(0) w.r.t. <*!2!", by Proposition 7, f4(B) is a 
basis for fae) w.r.t. <*!@l", Hence, at each iteration we can select, for each 
(i, p,q)-component, a basis w.r.t. < and then apply fa. In particular, we can 
keep antichains for each (i, p,q)-component, that is, finite bases of incomparable 
words. The analogue result holds for the Kleene iterates of r4. 


7.1 State-based Algorithm 


Next we consider Algorithm 1 instantiated with the pair of state-based qos (§ 6). 


Data Structures. Comparing two words given a state-based qo requires to 
compute the corresponding sets of contexts in 5. Instead of computing contexts 
every time we need to compare two words we cache the context information 
along with each word for faster retrieval. More precisely, we cache ctx®[u] along 
with u when uw is a prefix and we cache (ctx®[v], ctx8[v]) along with v when v is 
a period. Next we go even further and explain that new context information can 
be computed inductively from already computed context information. Assume 
we are computing a new word during the fixpoint computation, for instance the 
word cur that is obtained by flanking c and r to u. We will show that the context 
information of cur can be computed directly from that of u, c and r instead of 
computing cur from “scratch”. 


Fixpoint Computation. Given an input vector the functions f4 and r4 add 
new words of type uu’, and cur to its components, where c and r are fixed 
letters, and u,u’ are words already present in some components of the vector. 
The following equalities show that we can inductively compute the contexts and 
final contexts in B of newly added words in these functions: for every u, u’ € CUR, 
CE Xe, r E Xp, we have 


ctx8 [uu"] = {(p, q) € Q? | Ip: € Ô, (p, pi) € ctx? [u], (pi, q) € ctx? [u"]}, 
ctx” [cur] = {(p, q) € Q? | I(p',q') € ctx” [u], Sy € Ô, (p, c, p', y) € be, (4', 77,9) € Sr} - 


The definitions for ctx% [uu’] and ctx% [cur] are left as exercise to the reader. 
® ® 


Example 5. Using the above definition it is routine to check that ctx®[er] = 
{(p,9),(q,q)} because cr = cer, ctx*[e] = {(p, p), (q,4)} (Example 3) and 


(p,c,q, A), (g,¢,9¢, A) € ĉc, (q,r, A, q) € ôr. 
Using the context information cached along words we check convergence of 
the fixpoint computations (lines 1-2) using the following qos directly on contexts 
Ec on g((Q?))* for prefixes and Ecxc on plp(ĝ?) x p(Â?))? for periods. 
Incidentally, as we show below, we can perform the membership checks of 
lines 5 and 7 (asking whether wv” € L“(B) given u and v) using the context 
information associated to the prefix u and period v and nothing else. 
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Membership Check. To decide membership in L” (B) we use the membership 
predicate Inc” defined for x, y1, y2 € (Â?) as follows: 


Aq 


Inc” (x, y1,y2) Ê 39, € Q, (år, q) € £A (GP) E€ yï A (p,p) E yï o y20 y} , 
where, given two binary relations y, y’ € (Q?) on states of B, the notation yoy’ 
denotes their composition, and y* denotes the Kleene closure of y. 
Proposition 8. For all (u,v) € Ld, Inc® (ctx? [u], ctx? [v], ctaB[v]) => uv” € 
L*(B) . 
Proof. Let (u,v) € Ld. Note that if v € C (resp. v € R) then for every positive 
integer n we have v” € C (resp. v” € R) and (p,q) € ctx®[v]* —s Jn, (p,q) € 
ctx’ [v”]. Therefore, if Inc? (ctx [u], ctx” [v], ctx8[v]) holds then there are q,p € 
Q and two positive integers n,m such that (¢7,q) € ctx®[u], (q,p) € ctx®[v”] 
and (p,p) € ctx8[v™]. If (u,v) € C x C then we deduce an accepting trace 
of B on uv” of the form (¢z,-L) K*™ (q, L) F*”” (p, L) fer" (p, L) for wv”. If 
(u,v) € U. x R then we deduce an accepting trace of B on uv” of the form 
(dr, L) E*™ (q, w) K*”” (p, ww’) E”? (p, ww’w”) for some w, w, w” €T. 
Conversely if uv“ € L“ (B) then there is an accepting trace of B on wv”. 
— If (u,v) € C x C then this trace is of the form 


(dr, L) pe" (q, L) ber (a, L) za (q2, L) Fe" e 


Since Q is finite, there is p € Q and a sequence {nk }ken such that qn, = 
p for all k € N. Since the trace is accepting there is m € N such that 
(p, L)F®” (p, L). 
— If (u,v) € U. X R then it is of the form 
(dr, L) F*™ (q, wo) F*” (q1, wi) F* (q2, w1w2) F*” 


where for each j € N no symbol of wj is popped while reading v in the 
sequence of transitions (qj, wj) F*” (qj+1, wj;w;41). Thus, we can derive se- 
quences (q;,-L) F*” (q;41,w;41) for every j € N. There is p € Q and a 
sequence {nz }ren such that qn, = p for all k € N and since the trace is 
accepting there is m € N such that (p, L) +e" (By tig? Whyte) 


In both cases we deduce that (¢r,q) € ctx®[u], (q,p) € ctx®[v°] and (p,p) € 
etx8 [uv]. Thus, Inc” (ctx®[u], ctx? [v], ctx8 [v]) holds. 


By showing how to reason on contexts directly (for comparisons, for applying 
functions f4 and r4, for convergence check and for membership check) we re- 
moved the need to store words altogether since their contexts suffice. To sum 
up, Algorithm 1 instantiated with the state-based qos can be implemented by 
manipulating directly subsets of (Â?) (for the prefixes) and pairs of subsets 
of (Â?) (for the periods) thereby removing the need to store and manipulate 
words. We call this implementation of Algorithm 1 the state-based algorithm. We 
conclude this section with its complexity. 


Proposition 9. Letn ê |Q], ù ê |Q| and m ê maz{1, |5|}. The running time 
of the state-based algorithm is 20 )m2n4 
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8 Experiments 
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Fig. 2. Scatter plot comparing the runtime (in seconds) of Ultimate and omega VPLinc 
on the Ultimate suite. Both axis feature a logarithmic scale. When a tool does not 
return an answer within 1800 seconds (it runs out of time or memory) the data point 
is plotted on the edge thereof (top edge for Ultimate, right edge for omegaVPLinc). 


We implemented omega VPLinc [11] , a Java prototype of the state-based 
algorithm and evaluated it against Ultimate from Heizmann et al. [21] which 
decides inclusion via complementation, intersection and emptiness check.” 


Benchmarks. Our experiments use two sets of benchmarks. The first stems 
from [18] and consists of 5 queries L“ (A) C L“ (B) given A and B. We first trans- 
lated those VPA into the AutomataScript language that Ultimate and omega V- 
PLinc can use and then we minimized them with Ultimate. The second set of 
benchmarks consists of 281 instances of VPA A, B1, B2,..., Bn for which we run 
the query L” (A) C Uj_, L“ (B;). These VPA were computed by Ultimate from 
randomly selected tasks in SV-COMP (Software Verification Competition) ter- 
mination category. We used Ultimate to compute the unions of 6,,...,5, and 
then minimize the result before running each query. 


T We excluded FADecider [18] from our evaluation because it returned 22 false positive 
answers on a randomly chosen subset of 50 from our 286 benchmarks. Counterexam- 
ples to inclusion for these benchmarks were validated with Ultimate. The problem 
has been reported. 
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Experimental Setup. We ran our experiment in Debian/GNU Linux 11 (Bullseye) 
64bit, running on a server with 20 GB of RAM and 2 Xeon E5640 2.6 GHz CPUs. 
We used Ultimate version 0.2.1, with openJDK 11.0.13, whereas omegaVPLinc 
uses openJDK 17.0.1. Maximal heap size for both programs was set to 6 GB and 
they were given a timeout of 30 minutes (or, equivalently, 1800 seconds). 


Results. Of the 5 benchmarks in the FADecider suite, omegaVPLinc is faster 
on 4 of them. Our prototype times out on the remaining one, while Ultimate 
runs out of memory. Of the 281 benchmarks in the Ultimate suite, omega VPLinc 
correctly returns an answer on 253 (165 C and 88 É), times out on 27 and runs 
out of memory on 1. Ultimate, however, only terminates on 142 benchmarks, 
running out of memory on the remaining 139 (the red data points on the top 
edge in Fig. 2). There are 7 benchmarks for which Ultimate terminates, but 
omegaVPLinc doesn’t (the data points on the right edge but not the top one), 
whereas there are 118 benchmarks for which omegaVPLinc terminates, but Ul- 
timate doesn’t (the red data points on the top edge but not the right one). Of 
the 135 benchmarks on which both tools terminate, omegaVPLinc is faster than 
Ultimate on 123 (data points touching no edges and above the diagonal). More- 
over omegaVPLinc and Ultimate coincide on whether inclusion holds (98) or 
not (37). This empirical evaluation suggests that omegaVPLinc scales up better 
than Ultimate on both of these benchmark sets. 


9 Conclusion and Future Work 


We presented novel algorithms to solve the inclusion problem between visibly 
pushdown languages of infinite words that leverage antichain-like techniques as 
well as the use of separate quasiorders for prefixes and periods of ultimately 
periodic words. Our empirical evaluation suggests that our approach scales up 
better than the ones relying on an explicit complementation. A future work is to 
extend our approach to the class of operator-precedence languages [15] which also 
enjoy an EXPTIME-complete inclusion problem and which is strictly contained 
in the class of deterministic CFL, and strictly contains VPL [8]. 
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Abstract. A hyperproperty relates executions of a program and is used 
to formalize security objectives such as confidentiality, non-interference, 
privacy, and anonymity. Formally, a hyperproperty is a collection of al- 
lowable sets of executions. A program violates a hyperproperty if the set 
of its executions is not in the collection specified by the hyperproperty. 
The logic HyPERCTL* has been proposed in the literature to formally 
specify and verify hyperproperties. The problem of checking whether 
a finite-state program satisfies a HyPERCTL* formula is known to be 
decidable. However, the problem turns out to be undecidable for proce- 
dural (recursive) programs. Surprisingly, we show that decidability can 
be restored if we consider restricted classes of hyperproperties, namely 
those that relate only those executions of a program which have the same 
call-stack access pattern. We call such hyperproperties, stack-aware hy- 
perproperties. Our decision procedure can be used as a proof method for 
establishing security objectives such as noninference for recursive pro- 
grams, and also for refuting security objectives such as observational 
determinism. Further, if the call stack size is observable to the attacker, 
the decision procedure provides exact verification. 


Keywords: Hyperproperties - Temporal Logic - Recursive Programs - 
Model Checking - Pushdown Systems - Visibly Pushdown Automata. 


1 Introduction 


Temporal logics HyPERLTL and HyPERCTL* [5] were designed to express 
and reason about security guarantees that are hyperproperties [6]. A hyper- 
property [6] is a security guarantee that does not depend solely on individual 
executions. Instead, a hyperproperty relates multiple executions. For example, 
non-interference, a confidentiality property, states that any two executions of a 
program that differ only in high-level security inputs must have the same low- 
security observations. As pointed out in [6], several security guarantees are hy- 
perproperties. The logic HYPERCTL* subsumes HYPERLTL, and the problem 
of checking a finite-state system against a HYPERCTL* formula is decidable [5]. 
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In this paper, we consider the problem of model checking procedural (recur- 
sive) programs against security hyperproperties. Recall recursive programs are 
naturally modeled as a pushdown system. Unlike the case of finite-state tran- 
sition systems, the problem of checking whether a pushdown system satisfies a 
HyPERCTL* formula is undecidable [16]. In contrast, CTL* model checking is 
decidable for pushdown systems [3,18]. 


Our contributions. We consider restricted classes of hyperproperties for re- 
cursive programs, namely those that relate only those executions that have the 
same call-stack access pattern. Intuitively, two executions have the same stack 
access pattern if they access the call stack in the same manner at each step, i.e., 
if in one execution there is a push (pop) at a point, then there is a push (pop) 
at the same point in the other execution. Observe that if two executions have 
the same stack access pattern, then their stack sizes are the same at all times. 
We call such hyperproperties, stack-aware hyperproperties. 

In order to specify stack-aware hyperproperties, we extend HYPERCTL* to 
SHCTL*. The logic SHCTL* has a two level syntax. At the first level, the 
syntax is identical to HYPERCTL* formulas, and is interpreted over executions 
of the pushdown system with the same stack access pattern. At the top-level, 
we quantify over different stack access patterns. The formula Ew is true if for 
some stack access pattern p of the system, the pushdown system restricted to 
executions with stack access pattern p satisfies the HYPERCTL* formula w. The 
formula Ay is true if for each stack access pattern p of the system, the pushdown 
system restricted to executions with stack access pattern p satisfies the HYPER- 
CTL* formula 7. See Figure 1 on Page 8 for a side-by-side comparison of the 
syntax for HYPERCTL* and SHCTL*. HYPERLTL is extended to SHLTL simi- 
larly. Please note that SHCTL* subsumes SHLTL, and that SHCTL* (SHLTL) 
coincides with HyPERCTL* (HyPERLTL) for finite state systems as all execu- 
tions of finite state systems have the same stack access pattern. 

We show that the model checking problem for SHCTL* is decidable. We 
demonstrate three different ways this result can aid in verifying recursive pro- 
grams. First, for security guarantees such as noninference [14], which are ex- 
pressible in the V3* fragment of HYPERLTL, we can use the model checking 
algorithm to establish that a recursive program satisfies the HyPERLTL prop- 
erty. Secondly, for the V* fragment of HYPERLTL, the model checking algorithm 
can be used to detect security flaws by establishing that a recursive program does 
not satisfy security guarantees. Observational determinism [13,19] is an example 
of such a property. Finally, when the attacker can observe stack access patterns 
(or, equivalently, stack sizes), we can get exact verification for several proper- 
ties. The assumption of the attacker observing stack access patterns holds, for 
example, in the program counter security model [15] in which the attacker has 
access to program counters at each step. As argued in [15], the program security 
model is appropriate to capture control-flow side channels such as those arising 
from timing behavior and/or disclosure of errors. 


The decision procedure uses an automata-theoretic approach inspired by 
the model checking algorithm for finite state systems and HYPERCTL* given 
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in [10]. Since stack-aware hyperproperties relate only executions with the same 
stack access-pattern, a set of executions with the same stack access pattern 
can be encoded as a word over a pushdown alphabet, ? and the problem of 
model checking a SHCTL* formula can be reduced to the problem of check- 
ing emptiness of a non-deterministic visibly pushdown automaton (NVPA) over 
infinite words [1]. The reduction of the model checking problem to the empti- 
ness problem is based on a compositional construction of an automaton for each 
sub-formula which accepts exactly the set of assignments to path variables that 
satisfy the sub-formula. For this construction to be optimal, we carefully leverage 
two equi-expressive classes of automata on infinite words, namely NVPAs and 
1-way alternating jump automata (1-AJA) [4]. The model checking algorithm 
for SHCTL* against procedural programs has a complexity that is very close to 
the complexity of model checking finite state systems against HyPERCTL*. If 
g(k,n) denotes a tower of exponentials of height k, where the top most expo- 
nent is poly(n), then for a formula with formula complexity r, + and a system 
and formula whose size is bounded by n, our algorithm is in DTIME(g([5],7)). 
In comparison, model checking finite state systems against HyPERCTL* is in 
NSPACE(g([5] — 1,7)). This slight difference in complexity is consistent with 
checking other properties like invariants for finite state systems (NL) versus pro- 
cedural programs (P). 

We also prove that our model checking algorithm is optimal by proving a 
matching lower bound. Our proof showing DTIME(g([5],7)-hardness of the 
model checking problem for formulas with (formula) complexity r, relies on re- 
ducing the membership problem for g([5] — 1,n) space bounded alternating 
Turing machines (ATM) to the model checking problem. The reduction requires 
identifying an encoding of computations of ATMs, which are trees, as strings 
that can be guessed and generated by pushdown systems. The pushdown system 
we construct for the model checking problem guesses potential computations 
of the ATM, while the SHCTL* formula we construct checks if the guessed 
computation is a valid accepting computation. 


Related work. Clarkson and Schneider introduced hyperproperties [6] and 
demonstrated their need to capture complex security properties. Temporal logics 
HyYPERLTL and HyPERCTL*, that describe hyperproperties, were introduced 
by Clarkson et al. [5]. They also characterized the complexity of model checking 
finite state transition systems against HYPERCTL* specifications by a reduction 
to the satisfiability problem of QPTL [17]. Subsequently, other model checking 
algorithms for verifying finite state systems against HyPERCTL* properties 
have been proposed [10,7]. Tools that check satisfiability [8] and runtime verifi- 
cation [9] for HyPERLTL formulas have also been developed. Finkbeiner et al. 
introduced the automata-theoretic approach to model checking HyPERCTL* 
for finite-state systems [10]. 


3 A pushdown alphabet is an alphabet that is partitioned into three sets: a set of call 
symbols, a set of internal symbols, and a set of return symbols. See Section 4.1. 

4 Our definition of formula complexity is roughly double the usual notion of quantifier 
alternation. For a precise definition, see Definition 4. 
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The model checking problem for HyPERLTL, and consequently HyPER- 
CTL*, was shown to be undecidable for pushdown systems in [16]. For re- 
stricted fragments of HyPERLTL, Pommellet and Tayssir [16] introduced over- 
approximations and under-approximations to establish/refute that a pushdown 
system satisfies a HYPERLTL formula in those fragments. Gutsfeld et al. intro- 
duced stuttering H,,, a linear time logic for checking asynchronous hyperprop- 
erties for recursive programs in [12]. The authors present complexity results for 
the model checking problem under an assumption of fairness, and a restriction of 
well-alignment. While the restriction to paths with the same stack access pattern 
is similar to the well-alignment restriction, we do not assume any fairness con- 
dition to establish decidability. However, as SHCTL* is a branching time logic 
and only considers synchronous hyperproperties, the two logics are not directly 
comparable. It is also worth mentioning that the branching nature of SHCTL* 
requires us to “copy” a potentially unbounded stack, from the most recently 
quantified path variable, when assigning a path to the “current” quantified path 
variable. In contrast, all path assignments in [12] start with an empty stack. 

For lack of space reasons, some proofs are omitted and can be located in [2]. 


2 Motivation 


Clarkson and Schneider [6] argue that many important security guarantees are 
expressible only as hyperproperties. We discuss two examples of security hyper- 
properties, and the logics HYPERLTL and HyPERCTL* used to specify them. 


Hyperproperties and temporal logics. We discuss two variants of non- 
interference [11] that model confidentiality requirements. In non-interference, 
the inputs of a system are partitioned into low-level input security variables and 
high-level input security variables. The attacker is assumed to know the values of 
low-level security inputs. During an execution, the attacker can observe parts of 
the system configuration such as system outputs, or the memory usage. A system 
satisfies non-interference if the attacker cannot deduce the values of high-level 
inputs from the low-level observations. To formally specify the variants, we use 
the logic HYPERLTL [5], a fragment of the logic HYPERCTL* [5]. The precise 
syntax of HYPERLTL and HyPERCTL* is shown in Fig. 1. In the syntax, 7 is a 
path variable and the formula a, is true if the proposition a is true along the path 
“q”. Intuitively, the formula dz. w is existential quantification over paths, and is 
true if there is a path that can be assigned to m such that w is true. Universal 
quantification (Vz.w), and other logical connectives such as conjunction (A), 
implication (—), equivalence (+) and the temporal operators G and F can be 
defined in the standard way. By having explicit path variables, HyPERLTL and 
HyPERCTL* allow quantification over multiple paths simultaneously. 


Example 1. The first variant, noninference [14], states that for each execution o 
of a program, there is another execution o’ such that (a) g’ is obtained from o by 
replacing the high-level security inputs by a dummy input, and (b) ø and o” have 
the same low-level observations. Noninference is a hyperliveness property [5,6]. 
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Let us assume that the low-level observations of a configuration are deter- 
mined by the values of the propositions in L = {¢1,--+-&m}. As shown in [5], non- 


inference is expressible by the HYPERLTL formula: NI If Yy, dn’ (GAq) ATEL 
n’. Here GA, expresses that Globally (or in each configuration of the execution) 
the high input of 7’ is the dummy input A, and 7 =z 7’ E G(MeeL (lr © lr')) 
expresses that 7 and 7’ have the same low-level observations. 


Example 2. The second variant, observational determinism [13,19], states that 
any two executions that have the same low-level initial inputs, must have the 


same low-level output observations. Observational determinism is a hypersafety 


property [5,6], and is also expressible in HYPERLTL using the formula [5]: OD det 


Va.Va"'(a[0] =z in 7’ [0]) >T =L out n. Here =r in and =p out express the fact 
that 7 and 7’ have the same low-security inputs and outputs respectively. 


Procedural (recursive) programs and Stack-aware hyperproperties. 
Pushdown systems model procedural programs that do not dynamically allo- 
cate memory, and whose program variables take values in finite domains. Unlike 
finite-state transition systems, the problem of checking whether a pushdown sys- 
tem satisfies a HYPERCTL* formula is undecidable [16]. However, we identify a 
natural class of hyperproperties for which the model checking problem becomes 
decidable. As we shall shortly see, this class of hyperproperties not only enjoys 
decidability, but is also useful in reasoning about security hyperproperies such 
as noninference and observational determinism. 

We consider a restricted class of hyperproperties for recursive programs, 
which relate only executions that access the call stack in the same manner, 
i.e., push or pop at the same time. An execution of a pushdown system P is a 
sequence of configurations (control state + stack) o = cjc2--: , such that the 
stacks of consecutive configurations c; and c;+, differ only due to the possible 
presence of an additional element at the top of the stack of either c; or cj41. 
For such a sequence, we can associate a sequence pr(o) = 0102--- such that 
o; € {call, int, ret} such that o; = call (ret respectively) if and only if the stack 
in cj41 has one more (less respectively) element than c;. The sequence pr(c) is 
said to be the stack access pattern of ao. Observe that the stack sizes of two 
executions with the same stack access pattern evolve in a similar fashion. Thus, 
equivalently, we can consider this class of hyperproperties to be the hyperprop- 
erties that relate executions with identical memory usage. 

To specify these hyperproperties, we propose the logic SHCTL* which ex- 
tends HyPERCTL*. SHCTL* has a two level syntax. At the innermost level, 
the syntax is identical to that of HYPERCTL* formulas, but is interpreted over 
executions of the pushdown system with the same stack access pattern. At the 
outer level, we quantify over different stack access patterns. Intuitively, the for- 
mula Ew is true if there is a stack access pattern p exhibited by the system such 
that the set of executions with access pattern p satisfy the hyperproperty 4. 
The dual formula Aw, defined as ~E-), is true if for each stack access pattern 
p exhibited by the system, the set of all executions with stack access pattern p 
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satisfy 7). The syntax of SHLTL is obtained from HyPERLTL in a similar fash- 
ion. Please see Fig. 1 on Page 8 for a side-by-side comparison of the syntax of 
HyPERCTL* (HyPERLTL) and sHCTL* (SHLTL). Unlike HyPERCTL*, we 
show that the problem of checking SHCTL* is decidable for pushdown systems 
(Theorem 3). Formal definitions of stack access patterns, syntax and semantics 
of SHCTL* are in Section 3. 

For the rest of the paper, hyperproperties expressible in SHCTL* will be 
called stack-aware hyperproperties. Restricting to stack-aware hyperproperties is 
useful in verifying security guarantees of recursive programs as discussed below. 


Proving Vi* hyperproperties. The es property (Example 1) can 


be expressed in HYPERLTL as NI as Yr. Ir. (G Aq) Aw =z T. Consider the 


SHLTL formula A(NI) obtained by re an A in front NI. A pushdown sys- 
tem satisfies A(NI) only if for each execution o of the system, there is another 
execution o’ with the same stack access pattern as o such that o,o’ together 
satisfy (GAs) ^c =z, o’. Thus, if the pushdown system satisfies the SHLTL 
formula A(NI), then it also satisfies noninference. Thus, a decision procedure for 
SHLTL can be used to prove that a recursive program satisfies noninference. 
The above observation generalizes to HYPERLTL formulas of the form Y = 
Vr.dm....d,.w’ — if a system satisfies the SHLTL formula Ay then it must 
also antidty the HYPERLTL formula Y. Though the model checking problem 
is undecidable for pushdown systems even when restricted to such HYPERLTL 
formulas, we gain decidability by restricting the search space for 7,71,...,7 x. 


Refuting V* hyperproperties. Observational determinism (Example 2) can 


be written in HYPERLTL as OD a Va.Va' (a[0] =z in T [0]) >T =L out T. 


Consider the SHLTL formula A(OD). A pushdown system fails to satisfy the 
sHLTL formula A(OD) only if there is a stack access pattern p and executions 
cı and gz with stack access pattern p such that the pushdown system does not 
satisfy (o[0] =z in 0 [0]) > o =L out o’. 

This observation generalizes to HyPERLTL formulas of the form w 
Yri. ...VYTk.-Y' — if a pushdown system fails to satisfy the SHLTL formula 
Aw then it does not satisfy Y. Even though model checking pushdown systems 
against such restricted specifications is undecidable, our decision procedure can 
be used to show that a recursive program does not meet such properties. 


Exact verification when stack access pattern is observable. Often, it is 
reasonable to assume that the attacker can observe the stack access pattern. For 
example, in the program counter security model [15], the attacker has access to 
the program counter transcript, i.e., the sequence of program counters during an 
execution. Access to the program counter transcript implies that the attacker can 
observe stack access pattern. The assumption that the program counter tran- 
script is observable helps model control flow side channel attacks which include 
timing attacks and error disclosure attacks [15]. SHCTL* can be used to verify 
security guarantees in this security model. For example, the SHCTL* formula 
A(NI) models noninference faithfully by introducing a unique proposition for 
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each control state. Observational determinism can also be verified in this model 
by suitably transforming the pushdown automaton. 

Another scenario in which stack access patterns are observable is when the 
attacker can observe the memory usage of a program in terms of stack size. 
As observing stack size may lead to information leakage, stack size should be 
considered a low-level observation. Since the stack size can be unbounded, it 
cannot be modeled as a proposition. SHCTL*, however, can still be used to verify 
security guarantees in this case. For example, A( NI) = A(Vz. 3r. (G Ag?) AT =; 
m’) faithfully models non-inference as semantics of SHCTL* forces m and 7’ to 
have the same call-stack size in addition to other low-level observations. Once 
again, observational determinism can also be verified in this model by suitably 
transforming the pushdown automaton. 


3  Stack-aware Hyper Computation Tree Logic (SHCTL*) 


Stack-aware Hyper Computation Tree Logic (SHCTL*), and its sub-logic Stack- 
aware Hyper Linear Temporal Logic (SHLTL) are formally presented. We begin 
by establishing some conventions over strings. 


Strings. A string/word w over a finite alphabet X is a sequence w = apai- 
of finite or infinitely many symbols from X, i.e., a; € X for all i. The length 
of a string w, denoted |w|, is the number of symbols appearing in it — if w = 
apa, +++ an-ı is finite then |w| = n, and if w = aga, --- is infinite then |w| = w. 
The unique string of length 0, the empty string, is denoted e. For a string w = 
aga, ++: ait, w(t) = a; denotes the ith symbol, w| : i] = apa, ---a;_1 denotes 
the prefix of length i, wļi : ] = ajaj41--- denotes the suffix of w starting at 
position i, and wļi : j] = aja;41---aj;~-1 denotes the substring from position i 
(included) to position j (not included). Thus w[0 :] = w. By convention, when 
i < 0, we take w[: i] = £. Over X, the set of all finite strings is denoted X*, and 
the set of all infinite strings is denoted X“. For a finite string u and a (finite or 
infinite) string v, uv denotes the concatenation of u and v. 


3.1 Pushdown Systems 


Pushdown systems naturally model for sequential recursive programs. Formally, 
an AP-labeled pushdown system is a tuple P = (S,I, sin, A, L), where S is a 
finite set of control states, I’ is a finite set of stack symbols, sin E€ S is the initial 
control state, L : S — 24° is the labeling function, and A is the transition 
relation. The transition relation A = Aint U Aca U Aret is the disjoint union of 
internal transitions Ain, C S x S where the stack is unchanged, call transitions 
Aca E S x (S x I) where a single symbol is pushed onto the stack, and return 
transitions Are C (S x I) x S where a single symbol is popped from the stack. 
When AP is clear from the context, we simply refer to them as pushdown systems. 
Transition System Semantics. We recall the standard semantics of a push- 
down system as a transition system. Let us fix a pushdown system P = 
(S,I, Sin, A, L). A configuration c of P is a pair (s,a@) where s € S and a € I™*. 
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ac AP,rEeV 
Y = ar | av [YV Y |X% 0 ::= Ey | 0 | 0v8 
[YUY | ar. 4 Y = ar | W| Pvy|Xy|pUy| ary 
(a) HYPERCTL* (b) sHCTL* 


Fig. 1: BNF for HyPERCTL* and sHCTL*. Let V denote —~3~— and A denote ~E~y. 
HYPERLTL is the set of HyPERCTL* formulas Qim.---Qrmr.W where Q; € {3,Y} 
and w is quantifier-free. SHLTL is the set of SHCTL* formulas Œy, where Œ € {A, E} 
and y is in HyPpERLTL. 


The set of all configurations of P will be denoted Confp = S x I’*. The labeled 
transition system associated with P is [P] := (Confp,cin, —>,AP,L) where 
Cin = (Sin,€) is the initial configuration, — C Confp x ({call, ret, int} x S x 
(T Uf{e}) x S') x Confp is the transition relation, and L is the labeling function that 
extends the labeling function of P to configurations as follows: L(s,a) = L(s). 
The transition relation —> is defined to capture the informal semantics of inter- 


nal, call, and return transitions — for any a € I, (int) (s,a) (s’, a) 
iff (s,s) E€ Aint; (call) (s,a@) oder (s’, aa) iff (s,(s’,a)) € Acan; and (ret) 


(s, aa) enean), (s’, a) iff ((s,a), 8’) E€ Aret- 
A path of [P] is an infinite sequence of configurations o = co, c1, . . . such that 


(int,s,e,s’) 


for each i, ci a Ci+1 for some o € {int, call, ret}, s,s’ € Sanda E€ Fu {et}. 


The path ø is said to start in configuration cg (the first configuration in the 
sequence). We will use Paths([P],c) to denote the set of paths of [P] starting 
in the configuration c and Paths([P]) to denote all paths of [P]. 

We conclude this section by introducing some notation on configurations. For 
c = (s,a), its stack height is |a|, its control state is state(c) = s, and its top of 
stack symbol is top(c) =a € T if a = aa’ and is undefined if a = e. 


3.2 Syntax of sSHCTL* 


Let us fix a set of atomic propositions AP, and a set of path variables, V. The BNF 
grammar for SHCTL* formulas is given in Figure 1(b). In the BNF grammar, 
a € AP isan atomic proposition, 7 is a path variable, Y is a cognate formula, and 0 
is aSHCTL* formula. The syntax has two levels, with the inner level identical to 
HyPERCTL* formulas, while the outer level allows quantification over different 
stack access patterns (see Section 3.3). Also, following [5,10], we assume that the 
until operator U only occurs within the scope of a path quantifier. 


Remark 1. We have chosen to not have A, the dual of E, and conjunction as 
explicit logical operators to keep our exposition simple. This choice does makes 
the automata constructions presented here less efficient for formulas involving 
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conjunction. Adding them explicitly does not pose a technical challenge to our 
setup and our automata constructions can be extended to handle them explicitly. 
In addition, we will sometimes use other quantifiers and logical operators to write 
formulas. Some standard examples include: 6; A 82 = 7(76; V 762), where 6; (i € 
{1, 2}) is either a SHCTL* or cognate formula; Vr.W = =~ Ir. ~y; F Y = true U Y, 
where true = ar V ~ar; GY = AF rw. 


We call formulas of the form Œy (where Œ € {A, E} and w is a cognate 
formula) basic formulas. Observe that any SHCTL* formula is a Boolean com- 
bination of basic formulas. A SHCTL* formula @ is a sentence if in each basic 
sub-formula Hy, p is a sentence, i.e., every path variable appearing in w is 
quantified. Without loss of generality, we assume that in any cognate formula 4, 
all bound variables in y are renamed to ensure that any path variable is quanti- 
fied at most once. We will only consider SHCTL* sentences in this paper. The 
logic SHLTL is the sub-logic of SHCTL* consisting of all formulas of the form 
Qim.-+++Q,7,.w where Œ € {A, E}, Qi € {3,Y} and wv is quantifier free. 


3.3 Semantics of SHCTL* 


The syntax of cognate formulas is identical to that HYPERCTL* formulas. Their 
semantics will be described in a similar manner, in a context where free path 
variables in the formula are interpreted as executions of a system. However, we 
will require that the interpretations of every path variable share a common stack 
access pattern — hence the term cognate. Thus, before defining the semantics, 
we will define what we mean by the stack access pattern of a path and a path 
environment that assigns an interpretation to path variables. 

For the rest of this section let us fix a pushdown system P = (S, T, sin, A, L). 
A string w € {call, int, ret}* is said to be well matched if either w = € or w = 
int or w = call u ret or w = uv, where u,v € {call, int, ret}* are (recursively) 
well matched. In a string p € {call,int, ret}, p(i) is an unmatched return, if 
p| : i +1] = w ret, where w is well matched. We are now ready to present the 
definition of a stack access pattern. 


Definition 1 (Stack access pattern). A string p € {call, int, ret}” is a stack 
access pattern if the set {i € N| p(i) is an unmatched return} is finite. 

A path o = cocıc2 - -- € Paths([P]]) is said to have a stack access pattern p = 
0901 ::: (denoted pr(o) = p) if for every i: (a) o; = call if and only if stack(c;+41) 
= top(cj41) stack(c;), (b) o; = int if and only if stack(c;4,) = stack(c;), and 
(c) oi = ret if and only if stack(c;) = top(c;) stack(ci41). 


We now present the definition of path environment that interprets the free 
path variables in a cognate formula as paths of [P] such that they share a 
common stack access pattern. This plays a key role in defining the semantics of 
SHCTL*. For a set of path variables V, let Vt be defined as the set VU{7{}. 


Definition 2 (Path Environment). A path environment for pushdown sys- 
tem P over variables V is function IT : V? — Paths([P]) U{call, int, ret}” such 
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that IT(}{) is a stack access pattern , and for every m € V, II(m) € Paths([P]) 
with pr(II(7)) = I(t). When the pushdown system is clear from the context, we 
will simply refer to it as a path environment over V. 

When V = 0, we additionally require that there is a path o € Paths([[P]], cin) 
(where Cin is the initial configuration of [P]) such that pr(o) = I(t). 


We introduce some notation related to path environments. Let us fix a path 
environment JI over variables V. Given a path ø € Paths([P]]), [r +> a] denotes 
the path environment over VU{7} such that [7 > o|(a) = o, and Hẹ|r > 
o|(n') = T(r’), for any n’ € Vi with n Æ r. Finally, for i € N, Hfi : ] denotes the 
suffiz path environment, where every variable is mapped to the suffix of the path 
starting at position i. More formally, for every 7’ € Vt, Ii: (r) = H (rfi: J. 

We now define when a pushdown system P satisfies a SHCTL* sentence 0, 
denoted P 0. The definition of satisfaction of 0 relies on a definition of satis- 
faction for cognate formulas. To inductively to define the semantics of cognate 
formulas, we will interpret free path variables using a path environment. Fi- 
nally, as in HYPERCTL*, it is important to track the most recently quantified 
path variable because that influences the semantics of 47(-). Thus satisfaction of 
cognate formulas takes the form P, IT,’ | w, where 7’ is the most recently quan- 
tified path variable, and IJ is a path environment over the free variables of w. 
Finally, by convention, we will take Paths([P], H (t)(0)) to mean Paths([P], cin), 
where Cin is the initial configuration of [P] °. Below, 0,01, and 02 are SHCTL* 
sentences, while Y, Y1, Y2 are cognate formulas. 


P H-8 iff P KO 

P=81 V b2 if PE 6, or PE 02 

P = Ey iff for some path environment M over 0,P, H, t H% 

P, II, n'ar iff a € L(I (7)(0)) 

P, I, n’ =a iff P, H, r Ey 

P, I, n’ = Y1 V we iff P, I, n H yi or P, IT, T y2 

P, H, n’ = Xy if P, [1 :], r Ew 

P, H, T = yY U Ya iff i> 0: P, Hfi :], r Ede and Yj,O< j <i, 

P, Mj: |, = y 

P, I, n'ar. y iff do € Paths([P], H (7')(0)) with pr(o) = H(t), 
such that P, H[r > o], r 4 


~ 


4 A Decision Procedure for SHCTL* 


Given a pushdown system P and a SHCTL* sentence 0, we present an algorithm 
that determines if P | 6. Our approach is similar to the one in [10]. Given a finite 
state transition system K and a HYPERCTL* formula y, Finkbeiner et. al. [10], 
construct an alternating (finite state) Biichi automaton Ax,,, by induction on 
y, such that an input word ø is accepted by Ax, if and only if ø is the encoding 


5 The convention is needed because J (t)(0) is not a configuration but an element of 
the set {call, int, ret}. 
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of a path environment JT such that K, IT } y. Determining if K | y then reduces 
to checking if Ax, accepts any string. 

Extending these ideas to SHCTL* and pushdown systems, requires one to 
answer two questions: (a) What is an encoding of path environments for cog- 
nate formulas where path variables are mapped to sequences of configurations 
(control state + stack)?; (b) Which automata models can capture the collection 
of path environments satisfying a cognate formula with respect to a pushdown 
system? We encode path environments for cognate formulas using strings over 
a pushdown alphabet — pushdown tags on symbols adds structure that helps 
encode sequences of configurations. And for automata, we consider automata 
that process such strings and accept visibly pushdown languages. A natural gen- 
eralization of the approach outlined in [10] would suggest the use of alternating 
visibly pushdown automata (AVPA) on infinite strings [4]. However, using AV- 
PAs results in an inefficient algorithm. To get a more efficient algorithm, we 
instead rely on a careful use of nondeterministic visibly pushdown automata 
(NVPA) |1] and 1-way alternating jump automata (1-AJA) [4]. The advantage 
of using NVPA and 1-AJA can be seen in the case of existential quantification 
(Ar.) which requires converting an alternating automaton to a nondeterministic 
one [10]: Converting from 1-AJA to NVPA leads to exponential blowup while 
converting AVPA to NVPA leads to a doubly exponential blowup [4]. 

The rest of this section is organized as follows. We begin by introducing 
the automata models on pushdown alphabets (Section 4.1). Next we present 
our encoding of path environments, and finally our automata constructions that 
establish the decidability result (Section 4.2). 


4.1 Automata on Pushdown Alphabets 


A pushdown alphabet is a finite set X that is partitioned into three sets 
eat U Lint U Xret, Where Xea is the set of call symbols, Xin is the set of inter- 
nal symbols, and Xe is the set of return symbols. Automata models processing 
strings over a pushdown alphabet are restricted to perform certain types of tran- 
sitions based on whether the read symbol is a call, internal, or return symbol. 
We introduce, informally, two such automata models next. Precise definition and 
its semantics can be found in the detailed version of this paper [2]. 


Nondeterministic Visibly Pushdown Biichi Automata. A nondetermin- 
istic visibly pushdown automaton (NVPA) [1] is like a pushdown system. It has 
finitely many control states and uses an unbounded stack for storage. However, 
unlike a pushdown system, it is an automaton that processes an infinite sequence 
of input symbols from a pushdown alphabet X = Say U Lint U Sret- Transitions 
are constrained to conform to pushdown alphabet — whenever a Xea symbol 
is read, a symbol onto the stack, whenever a Xe symbol is read, the top stack 
symbol is popped, and whenever Xint symbol is read, the stack is unchanged. 


1-way Alternating Jump Automata. Our second automaton model is 1- 
way Alternating Parity Jump Automata (1-AJA) [4]. 1-AJA are computation- 
ally equivalent to NVPAs (i.e., accept the same class of languages) but provide 
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greater flexibility in describing algorithms. 1-AJAs are alternating automata, 
which means that they can define acceptance based on multiple runs of the ma- 
chine on an input word. Though they are finite state machines with no auxiliary 
storage, their ability to spawn a computation thread that jumps to a future 
portion of the input string on reading a symbol, allows them to have the same 
computational power as a more conventional machine with storage (like NVPAs). 

We present some useful properties of NVPA and 1-AJA. The two models are 
equi-expressive with the size of automata constructed by the translation known. 


Theorem 1 ([4]). For any NVPA N of size n, there is a 1-AJA An of size 
O(n?), such that L(An) = L(N). Conversely, for any 1-AJA A of size n, there 
is a NVPA N4 of size 2°™, such that L(N,) = L(A). Constructions can be 


carried out in time proportional to the size of the resulting automaton. 


Both 1-AJA and NVPAs are closed for language operations like complemen- 
tation, union and prefixing. We also recall the following result. 


Theorem 2 ([1]). For NVPAs, the emptiness problem is PTIME-complete. 


4.2 Algorithm for sHCTL* 


Let us fix a pushdown system P = (S,T, sin, A, L) and a SHCTL* sentence 0. 
Our goal is to decide if P |= 6. We will reduce this problem to checking the empti- 
ness of multiple NVPAs (Theorem 2). Our approach is similar to [10] — for each 
cognate sub-formula ~ (not necessarily sentence) of 0, we will compositionally 
construct an automaton that accepts the path environments satisfying Y. Path 
environments will be encoded by strings over pushdown alphabets as follows. 
For a path o = cocic2:-: of [P], the trace of øo, denoted tr(c), is the 
(unique) sequence (09, go, ao, G1) (01, G1, @1,92)-:: such that for every i € N, 


(0:,4i,44,Qi4+1) 
C EE 


Ci+1 Where o; € {call, int, ret}, qi, qi+1 € Q, and a; € T U {e} ê. 

While tr(o) is uniquely determined by the path ø, the converse is not true 
— different paths may have the same trace. To see this, consider the following 
example. For configuration c and y € I*, let y(c) denote the configuration 
(state(c), stack(c)y), i.e., the configuration with the same control state, but with 
stack containing the symbols in y at the bottom. Observe that, for any y € I™*, 
if o = cocicg: is a path then so is y(o) = y(co)y(cı)y(c2)---. Additionally, 
tr(a) = tr(7(c)). Two paths cı and o2 of [P] will be said to be equivalent if 
tr(o1) = tr(o2) and will be denoted as a, ~ a2. Observe that equivalent paths 
have the same stack access pattern , i.e. if 0) ~ a2 then pr(o1) = pr(c2). The 
semantics of SHCTL* doesn’t distinguish between equivalent paths. 


6 Observe that even when ø is not a path in [P] (i-e., corresponds to an actual se- 
quence of transitions of P), the trace of ø is uniquely defined as long as stacks of 
successive configurations of g can be obtained by leaving the stack unchanged, or 
pushing/popping one symbol. 
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Proposition 1. Let y be a cognate formula with V as the set of free path vari- 
ables. Let IT, and Ia be two path environments such that for every 7 E V, 
I(r) œ g(a). Then, P, Ih, =y if and only if P, I,m Ey. 


The proof of Proposition 1 follows by induction on cognate formulas. Propo- 
sition 1 establishes that the set of path environments satisfying a cognate for- 
mula is a union of equivalence classes with respect to path equivalence. Thus, 
instead of constructing automata that accept path environments, we will con- 
struct automata that accept mappings from path variables to traces of paths. 
For m € N, let Xfm] = Y[m]can YU X[M]in U [Met be the pushdown alpha- 
bet where X[m] = {call} x S™ x T™, L[mine = {int} x S™ x {e}™, and 
X[m] = {ret} x S™ x I'™. Observe X|0] is (essentially) the set {int, call, ret}. 


Definition 3 (Encoding Path Environments). Consider a set of m path 
variables V = {11, T2, ... Tm}. A string w E€ X|[m]” where for any j € N, w(j) = 
(0;, (s}, 52, ... 8,), (a3, a2, ...al,)) encodes all path environments IT such that 


I(t) = 000102 +: "Oj ETA 

tr(IT(7:)) E (00, 59, ay, 8; )(01, s}, a}, 57) D 
for any i € {1,2,...m}. The string encoding a path environment IT is denoted 
as enc(II) (= w, in this case). 


Based on the definitions, the following observation about traces and encod- 
ings can be concluded. 


Proposition 2. For any path o € Paths([P]) and i € N, tr(ofé : ]) = tr(o)fi : J. 
For any path environment II and i € N, enc(Hfi : ]) = enc(IT)fi : J. 


The encoding of path environments as strings over X[m] (for an appropriate 
value of m) is used in our decision procedure, which compositionally constructs 
automata that accept path environments satisfying each cognate formula. The 
size of our constructed automata, like in [10], will be tower of exponentials that 
depends on the formula complexity of the cognate formula y. 


Definition 4 (Formula Complexity). The formula complexity of a sSHCTL* 
formula ọ, denoted fc(p), is inductively defined as follows. Let odd : N > N be the 
function that maps a number n to the smallest odd number > n, i.e., odd(n) = n 
if n is odd and odd(n) = n+ 1 if n is even. Similarly, even : N + N maps n 
to the smallest even number > n, i.e., even(n) = odd(n +1) — 1. Below 41, %2 
denote cognate formulas, and 61,02 denote SHCTL* sentences. 


fc(ar) = 0 fc(-71) = even(fc(71)) fe(Xq1) = fe(w1) 
fe(W1 V p2) = max(fe(w1), fe(we)) ie U p2) = even(max(fe(w1), fe(¢2))) 
( 


fe(Ar. 1) = odd(fe(w1)) fe( By) = odd(fe(v1)) 
fc(70;) = fc(6,) fc 0i V 02) = max(fc(01), fc(6)) 


— 


Observe the difference in the definition of fc(—01) and fc(— 1); for 40; there is 
no change in formula complexity, while for =~, we move to the next even level. 
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Our main technical lemma is a compositional construction of an automaton 
for cognate formulas 7. Depending on the parity of fc(w), the automaton we 
construct will either be a 1-AJA or a NVPA. Before presenting this lemma, we 
define a function that is a tower of exponentials. For c, k,n € N, the value g.(k, n) 
is defined inductively on k as follows: g-(0,n) = cnlogn, and g-(k +1,n) = 
29°(k:n), We use goi1)(k,n) to denote the family of functions {ge(k, n) | c € N}. 


Lemma 1. Consider pushdown system P = (S,I, sin, A, L) and SHCTL* sen- 
tence 0. Let w be a cognate subformula of 0 with free path variables in the set 
V = {m1,...7m} form E N. We assume, without loss of generality, that the vari- 
ables T1,... Tm are in the order in which they are quantified in 0 with nm being 
the first free variable of w that will be quantified in the context 0. In addition, we 
assume that the size of both w and P is bounded by n. There is an automaton 
Ay over pushdown alphabet X[m] such that for any path environment II over V, 


P, I, Tm Ew if and only if enc(II) € L(Ay). 7 


The automaton Ay is a NVPA if fc(w) is odd, and a 1-AJA if fc(w) is even. 


The size of Ay is at most goa (fe) ec 


Before presenting the proof of Lemma 1, we would like to highlight a subtlety 
about its statement. The result guarantees that for valid path environments I, 
encoding enc(JI/) is accepted by Ay if and only if I satisfies 7. It says nothing 
about path environments that are not valid. In particular, there may be functions 
that map path variables to traces that do not correspond to actual paths of [P], 
but which are nonetheless accepted by Ay. Notice, however, when Y = Jr. ọ is 
a cognate sentence, a string over {call, int, ret} will, by conditions guaranteed in 
Lemma 1, be accepted if and only if it corresponds to a stack access pattern of 
a path from the initial state that satisfies dz. yı. 


Proof (Sketch of Lemma 1). Our construction of Ay will proceed inductively. 
The type of automaton constructed will be consistent with the parity of fc(q), 
i.e., an NVPA if fc(w) is odd and a 1-AJA if fc(q) is even. We sketch the main 
ideas here, with the full proof in [2]. 

For ar, 71, Yı V Y2, and Xw1, the construction essentially proceeds by con- 
verting Ay, (i € {1,2}) if needed, into the type (NVPA or 1-AJA) of the target 
automaton using Theorem 1, and then using standard closure properties to com- 
bine them to get the desired automaton. In case of Y = Yı U Y2, we first convert 
(if needed) Ay, (i € {1,2}) into a 1-AJA. At each step, the automaton for ~ 
will choose to either run Ay,, or run Ay, and restart itself. Correctness relies 
on the fact that our encoding for path environments satisfies Proposition 2. 

The most interesting case is that of y = Jr. yı. We will first convert (if 
needed) the automaton for pı into a NVPA A. The automaton for y will 
essentially guess the encoding of a path that is consistent with the transitions of 


T When m = 0, we take Tm to be f. 
8 When the size of the specification w is considered constant, the size of Ay is at most 


goo (fE) —1,n) 
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P, and check if assigning the guessed path to variable 7 satisfies yı by running 
the automaton A,. The additional requirement we have is that the guessed path 
start at the same configuration as the current configuration of the path assigned 
to variable mm which introduces some subtle challenges. In order to be able to 
guess a path, Ay will keep track of P’s control state in its control state, and use 
its stack to track P’s stack operations along the guessed path. Since the stacks 
of all paths are synchronized, it makes it possible for Ay to use its (single stack) 
to track the stack of both P and the stack of A}. 


Using Lemma 1, we can establish the main result of this section. 


Theorem 3. Given a P = (S,I, Sin, A, L) and aSHCTL* sentence 0, the prob- 
lem of determining if P |=0 is in UeDTIME(ge([ ©], n)), where n is a bound 
on the size of P and 0. 


Proof. Recall that a SHCTL* sentence is a Boolean combination of formulas of 
the form Ey, where w is a cognate sentence. Results on whether P = Ew for 
each such subformula can be combined to determine whether P = 6. Given this, 
the time to determine if P 0 is at most the time to decide if P satisfies each 
subformula of the form Ey plus O(n) (to compute the Boolean combination of 
these results). Next, recall that the construction in Lemma 1 ensures that for 
a cognate sentence of the form 3r. yY, £(Aaz.y) consists exactly of strings in 
{call, int, ret}” that encode a path environment over Ø that satisfy 3r. w. 
Consider a SHCTL* sentence Ey). Let m be a path variable that does not 
appear in the sentence p. Based on the semantics of SHCTL* the following 
observation holds: P = Ew if and only if for some path environment JT over 
0, P, MH, t Har.. Which is equivalent to saying that P = Ey if and only if 
L(Aar.y) #0. Since fe(Ew) = fe(d7. y), and the emptiness problem of NVPA 
can be decided in polynomial time (Theorem 2), our theorem follows. 


5 Lower Bound 


In this section, we establish a lower bound for the problem of model checking 
SHCTL* sentences against pushdown systems. Our proof establishes a hardness 
result for the SHLTL sub-fragment of SHCTL*. Before presenting this lower 
bound, we introduce the function h,(-,-), which is another tower of exponentials, 
inductively defined as follows: he(0, n) = n, and h.(k+1,n) = helk, n) + cle”), 


Theorem 4. Let P be a pushdown system and 0 be aSHLTL sentence such that 
the sizes of both P and 0 is bounded by n and fc(@) = 2k — 1 for some k € N. 
The problem of checking if P 0 is DTIME(h.(k,n))-hard, for every cE N. 


Proof (Sketch). We sketch the main intuitions behind the proof. To highlight the 
novelties of this proof, it is useful to recall how NSPACE(he(k—1, m))-hardness for 
HyPERLTL model checking is proved [5]. The idea is to reduce the language of 
a nondeterministic he(k— 1, n) space bounded machine M to the model checking 
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problem by constructing a finite state transition system that guesses a run of 
M, and a HyPERLTL formula that checks if the path is a valid accepting run. 

To get the stricter bound of DTIME(h,(k,n)), we use the fact that we are 
checking pushdown systems. The stack of the pushdown system can be used 
to guess a tree, as opposed to a simple trace. Therefore, we reduce a h.(k — 
1,n) space bounded alternating Turing machine, instead of a nondeterministic 
machine. Since ASPACE(f(n)) = DTIME(2°/™)) for f(n) > logn, the theorem 
will follow if the reduction succeeds. 

Recall that a run of an alternating Turing machine M is a rooted, labeled tree, 
where vertices are labeled by configurations of M in a manner that is consistent 
with the transition function of M. To faithfully encode a tree as a sequence 
of symbols, we record the DFS traversal of the tree, making explicit the stack 
operations performed during such a traversal. Consider a labeled, rooted tree T 
with root r whose label is (r) with T; as a the left sub-tree and T as the right 
sub-tree. The DFS traversal of T will push (r), traverse T, recursively, pop ¢(r), 
push (r), traverse T>, and then pop ¢(r). We will use such a DFS traversal to 
guess and encode runs of M. Popping and pushing ¢(r) between the traversals 
of Tı and Tọ may seem redundant. Why not simply do nothing between the 
traversals of T} and Ty? For T to be a valid run of M, the configuration labeling 
of the root of Tə must be the result of taking one step from ¢(r). Such checks 
will be encoded in our SHLTL sentence, and for that to be possible, we need 
successive configurations of M to be consecutive in the string encoding. 

To highlight some additional consistency checks, let us continue with our 
example tree T from the previous paragraph. For a string to be a correct encoding 
of T, it is necessary that the string pushed before the traversal of T; (i € {1,2}) 
be the same as the string popped after the traversal. This can be ensured by the 
pushdown system by actually pushing and popping those symbols. In addition, 
the string popped after 7,’s traversal must be the same as the string pushed 
before T>’s traversal. Neither the stack nor the finite control of the pushdown 
system can be used to ensure this. Instead this must be checked by the SHLTL 
sentence we construct. But the symbols while popping (r) will be in reverse 
order of the symbols being pushed, and it is challenging to perform this check 
in the formula. To overcome this, we push/pop the label and its reverse at the 
same time. This ensures that if we want to check if a string pushed is the same 
as a string that was just popped, then we can check for string equality, and this 
check is easier to do using formulas in SHLTL. Additional checks to ensure that 
the tree encodes a valid accepting run are performed by the SHLTL sentence 
using ideas from [17]. Full details can be found in [2]. 


6 Conclusions 


In this paper, we introduced a branching time temporal logic SHCTL* that can 
be used to specify synchronous hyperproperties for recursive programs modeled 
as pushdown systems. The primary difference from the standard branching time 
logic HYPERCTL* for synchronous hyperproperties is that SHCTL* considers 
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a restricted class of hyperproperties, namely, those that relate only executions 
that the same stack access pattern. We call such hyperproperties stack-aware 
hyperproperties. We showed that the problem of model checking pushdown sys- 
tems SHCTL* specifications is decidable, and characterized its complexity. We 
also showed how this result can potentially be used to aid security verification. 
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Abstract. Modern SAT solvers produce proofs of unsatisfiability to jus- 
tify the correctness of their results. These proofs, which are usually repre- 
sented in the well-known DRAT format, can often become huge, requiring 
multiple gigabytes of disk storage. We present a technique for semantic 
proof compression that selects a subset of important clauses from a proof 
and stores them as a so-called proof skeleton. This proof skeleton can 
later be used to efficiently reconstruct a full proof by exploiting paral- 
lelism. We implemented our approach on top of the award-winning SAT 
solver CaDiCaL and the proof checker DRAT-trim. In an experimental 
evaluation, we demonstrate that we can compress proofs into skeletons 
that are 100 to 5,000 times smaller than the original proofs. For almost 
all problems, proof reconstruction using a skeleton improves the solving 
time on a single core, and is around five times faster when using 24 cores. 


Keywords: SAT solving - proofs - compression. 


1 Introduction 


Solvers for the Boolean satisfiability problem (SAT) take as input a formula of 
propositional logic and decide if the formula is satisfiable. In case of satisfiability, 
they usually return an assignment of truth values to the variables of the formula; 
by plugging these truth values into the formula, users can easily convince them- 
selves that the solver was right and that the formula is indeed satisfiable. In 
case of unsatisfiability, however, things are more complicated: to justify their 
answer, solvers need to produce an independently checkable proof that none of 
the—exponentially many—potential truth assignments make the formula true. 
In practical SAT solving, proofs of unsatisfiability are represented in the 
DRAT format [10], and they are often huge, requiring several gigabytes (in some 
cases even terabytes [12] or petabytes [11]) of disk storage. Storing proofs is thus 
costly, especially since users might not require access to the proofs until sometime 
long after solving, at a point when proof verification or further analysis is desired. 
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Research (ONR) and the Army Research Office (ARO). 
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Up to now, the only options to deal with this problem were either to not 
store proofs and instead recompute them on demand—a laborious but plausible 
approach considering that proof checking typically takes longer than solving—or 
to use compression methods to reduce proof size. However, syntactic compression 
techniques (such as LZMA or DEFLATE, as supported by the ZIP file format) 
only provide moderate levels of compression. The same can be said about existing 
semantic compression techniques for proofs in SAT and SMT (c.f. [4, 18, 21]), 
which only achieve 20% compression on average. 


In this paper, we present a novel approach to semantic compression that 
stores only a small subset of the clauses derived by a solver, called a proof 
skeleton. We can achieve strong compression rates with proof skeletons (around 
100 to 5,000 times smaller than the original proof), while still retaining enough 
information to allow for a quick on-demand reconstruction of a complete proof 
that might differ from the original proof. This is similar to how a mathematician 
might put down the most important reasoning steps of a proof in a proof sketch, 
enabling a moderately talented reader to fill in the gaps. In our case, the gaps can 
even be filled independently, meaning that multiple readers can work in parallel. 


We present both an online version (creating a proof skeleton during solv- 
ing) and an offline version (creating a proof skeleton from a full proof) of our 
approach. We select the clauses that end up in a proof skeleton by relying on 
several heuristics such as glue (a heuristic used internally by solvers to estimate 
the usefulness of clauses) for online and clause activity (a measure of how often a 
clause is used to derive new clauses) for offline. To reconstruct a full proof from a 
proof skeleton, we utilize multiple incremental SAT solvers that can run in paral- 
lel. We implemented all our algorithms on top of the award-winning SAT solver 
CADICAL [2] and the proof checker DRAT-TRIM [22]. In an extensive empirical 
evaluation, we demonstrate the feasibility of our approach, with all code and 
data available at https://github.com/amazon-science/unsat-proof-skeletons. 


Beyond being a tool for compression, proof skeletons can also serve as a 
source of insight into a solver’s reasoning. Getting any sort of intuition from a 
million-line proof is difficult; by computing a skeleton, we obtain a small set of 
facts—logically implied by the problem—that can give us an idea of how a solver 
established the unsatisfiability of a formula. This can lead to a feedback loop 
that improves solver performance. For example, when inspecting skeletons for 
some bounded-model-checking benchmarks, we observed many unit clauses and 
binary clauses of a certain type. From this, we hypothesized that the problems 
required more preprocessing, which did indeed improve performance. 


Our main contributions are as follows: (1) We present a semantic approach 
for proof compression that selects only the most important clauses of a proof. 
(2) We implemented an online version and an offline version of our approach on 
top of the SAT solver CADICAL and the proof checker DRAT-TRIM. (3) In an 
extensive empirical evaluation, we demonstrate that our approach can drastically 
reduce proof size while still enabling efficient proof reconstruction. 

The rest of this paper is structured as follows. In Section 2, we discuss back- 
ground required to understand our paper and review related work. In Section 3, 
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we outline the main idea behind our proof-compression approach. In Section 4, 
we show multiple ways to create proof skeletons, and in Section 5 we show how 
to reconstruct full proofs from skeletons. Finally, in Section 6, we present an 
empirical evaluation of our approach before concluding in Section 7. 


2 Background and Related Work 


The Boolean satisfiability problem (SAT) takes as input a formula of proposi- 
tional logic and asks if there exists a truth assignment under which the formula 
evaluates to true. As is common in SAT solving, we consider propositional for- 
mulas in conjunctive normal form (CNF), which are defined as follows. A literal 
is either a variable x (a positive literal) or the negation Z of a variable x (a 
negative literal). The complement l of a literal l is defined as | = @ if | = x 
and as | = x if | = z. For a literal l, we denote the variable of | by var(I). 
A clause is a finite disjunction of the form (h V +-+- V ln), where l,,...,l) are 
literals. Clauses with only one literal are called unit clauses and clauses with two 
literals are called binary clauses. We denote the empty clause by L. A formula 
is a finite conjunction of the form C4 A---A Cm, where C1,..., Cm are clauses. 
For example, (x V y) A (z) A (z V Z) is a formula consisting of the clauses (x V 9), 
(z), and (z V 2). 

A truth assignment (or assignment for short) is a function from a set of 
variables to the truth values 1 (true) and 0 (false). A literal l is satisfied by an 
assignment a if l is positive and a(var(l)) = 1 or if l is negative and a(var(l)) = 
0. A literal l is falsified by an assignment if its complement 1 is satisfied by the 
assignment. A clause C is satisfied by an assignment a if a satisfies at least 
one of C’s literals. A formula w is satisfied by an assignment a if a satisfies 
all of ~’s clauses. A formula is satisfiable if there exists an assignment that 
satisfies it, otherwise it is unsatisfiable. A clause C = (lı V--- V lk) is implied 
by a formula w, denoted by Ww } C, if all satisfying assignments of Y satisfy C, 
or equivalently, if y A C is unsatisfiable, where C = (l1) A -++ A (Iz). In case a 
formula is satisfiable, modern solvers can output a satisfying assignment; in case 
the formula is unsatisfiable, most solvers can output a proof of unsatisfiability. 


Proofs of Unsatisfiability. State-of-the-art SAT solvers produce so-called clausal 
proofs. Intuitively, a clausal proof is a list of clause additions and clause deletions. 
Formally, a clausal proof is a list of pairs (s1,C1),...,(Sm,Cm), where for each 
i €1,...,m, si E {a,d} and C; is a clause. If s; = a, the pair is called an 
addition, and if s; = d, it is called a deletion. For a given input formula wo, a 
clausal proof gives rise to accumulated formulas a; (i € 1,...,m) as follows: 


wi _ Wi-1 U {C;} if sj=a 
E Wi-1 \ {C;} ifs; =d 
The clauses of an accumulated formula Y; are also called the active clauses 


at point 7. Clause additions must preserve satisfiability, which is usually guaran- 
teed by requiring the added clauses to fulfill some efficiently decidable syntactic 
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criterion that itself implies satisfiability is preserved. Deletions are unrestricted 
and are not useful for proving unsatisfiability as they only make a formula “more 
satisfiable”; their main purpose is to speed up proof checking by keeping the set 
of active clauses small. A valid proof of unsatisfiability must end with the addi- 
tion of the empty clause. As the empty clause is trivially unsatisfiable, and since 
all proof steps preserve satisfiability, the unsatisfiability of the original formula 
can then be concluded. 

Clausal proof systems are distinguished by the syntactic criterion they impose 
on clause additions. The standard SAT solving paradigm conflict-driven clause 
learning (CDCL) [15,16] adds so-called RUP (short for reverse unit propagation) 
clauses [20], whose definition is based on the notion of unit propagation. Unit 
propagation is the process of repeatedly applying the unit-clause rule to a for- 
mula until no unit clauses are left. Given a formula Y, the unit-clause rule takes 
a unit clause (l) and makes its literal l true, meaning that (1) all clauses that 
contain l are removed from y, and (2) the negation l of | is removed from all re- 
maining clauses. If unit propagation produces the empty clause, we say it derived 
a conflict. For example, unit propagation derives a conflict on (x) \(ZVy)A(ZV9) 
as the application of the unit-clause rule for (x) produces the formula (y) A (9), 
on which another application of the unit-clause rule, with either of (y) or (9), 
produces the empty clause. If unit propagation derives a conflict on a formula, 
the formula is clearly unsatisfiable, but not vice versa. 

A clause C = (lı V--- V lk) is a RUP for a formula ~ if unit propagation 
derives a conflict on Y A C. If C is a RUP for y, it is implied by ~ since ọ AC 
is unsatisfiable; we thus sometimes write ~ Fı C to denote that C is a RUP 
for =. The clausal proof system allowing the addition of RUP clauses together 
with deletions is called DRUP. Solvers participating in the SAT competition 
must produce DRAT proofs, but since each DRUP proof is also a DRAT proof 
(but not vice versa) and since all state-of-the-art solvers actually produce DRUP 
proofs by default, we restrict this study of proof compression to DRUP proofs. 

A proof checker is an independent tool that verifies the correctness of proofs. 
There exist formally verified proof checkers that provide strong correctness guar- 
antees (c.f., [5,9,14,19]). Because these tools are inefficient, proofs are often 
passed through an—efficient but unverified—intermediary proof checker (such 
as DRAT-TRIM [22]) that transforms a DRAT proof into a so-called LRAT 
proof [5]. The resulting LRAT proof includes additional information (called 
hints), which allows a formally verified checker to efficiently check the proof. 


3 Problem Overview 


We want to compress proofs into small representations that can be efficiently 
decompressed into full proofs. Existing techniques for SAT and SMT focus 
on transformations and substitutions that preserve validity to generate smaller 
proofs [4,18,21]. We achieve greater compression by storing only a so-called proof 
skeleton, which itself is not a valid proof. 
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Tools like SLEDGEHAMMER [3] automatically solve proof obligations from 
interactive theorem provers, filling gaps in the proof by translating lower-level 
reasoning into the theorem provers’ logic. More recent work proposed a method 
for constructing proofs for complex SMT rewriting steps on demand in a post- 
processing step [17]. In a similar way, we use proof skeletons to efficiently recon- 
struct valid proofs that can differ from the original proofs. 

Suppose you solved an unsatisfiable CNF formula w, and out of the many 
facts you learned during solving, there were three facts A, B, and C, which you 
deem particularly important for showing the unsatisfiability of =. You can then 
build a proof skeleton from A, B, and C. Later, you can rephrase the question 
p = L (“does w imply the empty clause?”, or equivalently, “is y unsatisfiable?”) 
into the following questions: 


PEA vAAEB vAAABEC PAAABACEL 


Not only do A, B, and C provide a way to partition the proof effort, when 
ordered carefully, they can be used as assumptions in subsequent questions. 
Each question can be submitted to a solver independently, and combining the 
four resulting proofs will give a proof of the original claim that ~ is unsatisfiable. 

Our work translates this general schema to the realm of SAT by (1) deter- 
mining which learned clauses from a SAT solver are most useful and should be 
stored in a proof skeleton; (2) carefully grouping solver calls to prevent repeated 
work when producing partial proofs from a proof skeleton; and (3) stitching the 
partial proofs together to generate a complete proof. 


Determining which clauses are stored in a proof skeleton. We co-opt the clause- 
importance metrics used by CDCL solvers. We give a brief overview of these 
metrics in the following. CDCL solvers make progress by continuously learning 
new clauses that help them prune the search space of possible truth assignments. 
To limit memory usage, they occasionally perform a clause database reduction, 
removing a large portion of learned clauses based on some usefulness heuristics. 
Most solvers keep clauses that are short, have low glue value, are reason clauses, 
or have been used recently. The glue of a clause (also known as its literal block 
distance, or LBD) is a positive integer that estimates the usefulness of a clause. 
Intuitively, a low glue value means that few decisions are required to falsify the 
clause, which is considered good. For a more extensive discussion of glue, we 
refer to the respective literature [1]. A reason clause is a clause that was used by 
the solver when performing unit propagation, meaning that the clause became 
a unit clause under a partial assignment. The number of times a reason clause 
is used during conflict analysis is considered the clause’s activity. 


Grouping solver calls for partial proofs. We leverage incremental SAT to con- 
struct partial proofs. An incremental SAT solver solves a problem with several 
related steps, with the solver retaining state (e.g., learned clauses and heuris- 
tics) between steps; it also allows solving under so-called assumptions, which are 
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literals assumed to be true in a step. Solving a sequence of related steps incre- 
mentally is often much faster than solving them independently of each other (for 
more details on incremental SAT see, e.g., [6]). 

Given a formula y% and a sequence C1, ...,Cn of clauses, we want to produce 
a DRUP proof of ùy = C; for each i € 1,...,n. We use an incremental solver 
to produce partial proofs, with each solving step corresponding to a clause C;. 
For the first step, ù H C1, we pass the assumptions Ĉi = lı A--- A lẹ to the 
incremental solver. Given the formula Y, the solver assigns the literals in the 
assumptions, then runs the CDCL algorithm until it derives the empty clause. 
During solving, CDCL guarantees that all learned clauses are RUPs for the 
input formula w. Let ġı denote the sequence of clauses learned by the solver. 
Then, since unit propagation under the assumptions lı A --- A lp derived the 
empty clause, C1 is by definition a RUP for w A ¢;. This means that Cı can be 
appended to the corresponding proof of the solver (which derives all clauses in 
@1) to obtain a valid DRUP derivation of Cı from w. 

In the next step, the clause C2 is handled similarly, except the solver retains 
the learned clauses ¢; \C; when proving that C2 is a RUP clause. This continues 
until all n + 1 steps corresponding to the n clauses of the proof skeleton are 
completed (step n + 1 corresponds to the derivation of the empty clause). 

To parallelize this reasoning, we use an approach akin to divide-and-conquer 
techniques established in parallel SAT solving [13]. Divide-and-conquer solvers 
first partition a problem into multiple subproblems and then solve the subprob- 
lems in parallel. Similarly, we divide the incremental solver steps into so-called 
chunks, which are independent groups of subsequent solver steps. For example, 
we can split the solver steps into one chunk containing the first half of steps 
and another chunk containing the second half of steps. Both chunks can then be 
solved in parallel by two independent incremental SAT solvers. 


Stitching partial proofs together. Once we have partial proofs for all n+1 solving 
steps, a full proof of unsatisfiability can be constructed as the sequence of clause 
additions arising from ¢ġ1, C1, ¢2,C2,...,Crn,¢n+1,-L, where ¢; is the sequence 
of learned clauses by the i-th solver step, as explained above. In general, clauses 
are added and deleted during solving, so the proof can be augmented with the 
deletion information contained in the proofs emitted by a solver. But, we need 
to ensure clauses are not deleted in the proof and then implicitly reintroduced 
into a solver, which can occur when inprocessing techniques touch variables in 
the assumptions. We use variable freezing |T] to freeze all variables occurring in 
C,,...,C,; this avoids any unsound inprocessing [8], and is required to ensure 
correctness of the proofs. 


4 Creating Proof Skeletons 


Given a clausal proof P = (s1, C1), ..., (Sm, Cm), we define a proof skeleton of P 
to be a sequence of clauses obtained from clause additions in P. Ideally, a skele- 
ton is small but contains enough useful clauses to guide reasoning during proof 
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reconstruction. A proof skeleton can be constructed online, during the solver’s 
execution, by applying a filter to clauses as they are traced to a proof. Alterna- 
tively, a proof skeleton can be constructed offline, after solving, by processing 
the full proof and selecting important clauses. 


4.1 Online Generation of Proof Skeletons 


We create proof skeletons online by filtering clause additions as the solver traces 
them to a proof. Clauses that pass a usefulness threshold are added to the 
skeleton. As mentioned earlier, the filter applies usefulness heuristics from CDCL 
including glue and clause activity. Additionally, at certain intervals we add reason 
clauses to the skeleton. We implemented the filter within the solver CADICAL, 
giving us access to these values as well as to the reason clauses (through the trail 
of assignments). We also enabled logging, giving every clause a unique identifier, 
in order to sort the skeletons. We evaluate three different configurations: 


— GLUE: Clauses with glue lower than 3. 

— GLUE+TRAIL: Clauses with glue lower than 3, and all reason clauses on the 
trail before each clause-database reduction. 

— DYNAMIC: Clauses with glue lower than some dynamically adjusted thresh- 
old glue,, and all reason clauses on the trail every 50,000 learned clauses. 


The first two configurations combine low-glue clauses with either no or some 
reason clauses. Increasing the glue value threshold often led to a compression of 
less than 1,000 times and slower reconstruction. Reason clauses are important 
because they are actively used by the solver whereas for low-glue clauses this 
is not guaranteed (although low glue is associated with high usage in general). 
Clause-database reductions are sparse, so reason clauses (which are added only 
during these reductions) will be added infrequently. We evaluate the impact of 
including reason clauses in the skeletons in Section 6.3. 

In the first two configurations, all clauses passing the filter are accepted into 
the skeleton. For some formulas, a solver will produce many low-glue clauses 
and the skeleton will become too large, and for others too few low-glue clauses 
will lead to a small skeleton. Our third configuration accounts for the differences 
between formulas by adjusting heuristics dynamically to meet a desired com- 
pression ratio. The heuristics are updated based on the number of clauses added 
to the skeleton within some number of conflicts, denoted as window,. For a com- 
pression ratio between 500 and 1,000, and a window, value of 5,000, we tuned 
the DYNAMIC configuration in the following way: every 5,000 conflicts, if more 
than 25 (window,/200) lemmas passed the filter, the glue, value is decreased, 
and if less than 3 lemmas (window-/2,000) passed the filter, the glue, value is 
increased. Reasons from the trail are added every 50,000 conflicts (window, x 10). 

For configurations using reason clauses, the unique clause IDs are used to 
sort the skeleton. This is necessary because reason clauses are traced during 
reductions, so they may initially appear in the skeleton long after they were 
learned by the solver. During proof reconstruction it is important that clauses 
appear in the skeleton in an order that corresponds with a solver’s reasoning. 
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We implemented additional configurations using clause activities. For this, 
we incremented an activity field for each clause every time it was used during 
conflict analysis. An evaluation of these additional configurations is beyond the 
scope of this paper, but data can be found in the paper’s repository. 


4.2 Offline Generation of Proof Skeletons 


We create proof skeletons offline by processing a full proof and selecting the 
most active clauses. Given a DRAT proof, the tool DRAT-TRIM uses backwards 
checking to generate an optimized LRAT proof and, optionally, an UNSAT core 
(i.e., an unsatisfiable subset of the original formula). From the LRAT proof, 
we can estimate a clause’s activity by counting the number of times the clause 
appears in a hint of a clause-addition step. We then add the clauses with the 
highest activity to the skeleton until a target compression ratio is met. We found 
for most problems the target 1,000 provided optimal reconstruction performance. 
We sort the skeleton by each clause’s first use as a hint in the LRAT proof, 
signifying when a clause is actually used as opposed to when it is learned. We 
evaluate three configurations for offline generation: 


— OFFLINE: Select 1,000 times fewer clauses than in the original DRAT proof. 

— OFFLINE+UNITS: Additionally include all unit clauses from the proof. 

— OFFLINE-OPT: Select 1,000 times fewer clauses than in the optimized LRAT 
proof. 


The motivation for OFFLINE-OPT is that some optimized LRAT proofs have 
significantly fewer clauses than the DRAT proofs, resulting from many unused 
lemmas, which suggests that stronger compression is possible. 

Offline construction requires expensive post-processing with DRAT-TRIM. 
However, during online construction we can only guess the future usefulness 
of clauses when they are derived, by relying on heuristics such as glue, but we 
cannot know how often a clause will actually be used. For instance, it may be that 
a clause has low glue (predicting high usefulness) but is learned and then never 
used in the rest of the proof, making it worthless in the skeleton. In contrast, 
when constructing a skeleton offline—after solving—we know already how often 
the clause was actually used in reasoning throughout the proof, and whether it 
was used to derive the empty clause. Also, we can use the UNSAT core instead 
of the original formula when reconstructing a proof for the original problem. 


5 Reconstructing Proofs from Skeletons 


We reconstruct proofs by filling the gaps of a proof skeleton with a SAT solver. 
Once we have proofs for all gaps, we stitch them together with the clauses of the 
skeleton to create a complete proof. We can utilize information obtained during 
proof reconstruction to further shrink skeletons by removing less useful clauses. 
Finally, we can also use a skeleton to create an optimized LRAT proof. 
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Proof Skeleton Reconstruction Incremental Reconstruction 
Ci C2 p = Coa Y H C2 : Qı 
C2 Cs C2 C2 
C3 : Y A C2 ECs p A C2 A Qi H Os : be 
Ca Cs Cs 
Cs 
p A Skeleton = L p A Skeleton \ ġ = L 


Fig. 1. Proof reconstruction from a proof skeleton and a formula ¢ by filling in the 
gaps between skeleton clauses. This can be done with independent SAT calls or with 
an incremental SAT solver that keeps learned clauses (¢;) between steps. 


5.1 Filling Skeletons Using Incremental Solvers 


We consider two ways of filling a proof skeleton’s gaps—reconstruction and in- 
cremental reconstruction; both are illustrated in Fig. 1. Given a formula ¢@ and 
a skeleton C1,...,Cn, reconstruction fills each gap Y A C1 A+ A Ci-z1 FE Ci 
using independent SAT solver calls, with %1 A Cy A--- AC, H L as the final 
call. Filling a gap for C; = (l V +-+- V lk) involves assuming lı A --- Alk and 
deriving the empty clause with proof ¢, which proves that C; is a RUP for 
Y ACi At A Ci AQ. Each gap has an associated DRUP proof ¢; emitted 
by the solver. Since RUP is a monotonic property, the clauses added in ġ; will 
not affect the validity of ¢; for i < j. However, clause deletions could make 
the proof ¢1, (a, C1), $2, (a, C2),..., (a, Cn), n41, L incorrect. For example, if a 
skeleton clause C4 is deleted in ¢2, then ¢3 (stemming from w A C1 A C2 = C3) 
may use Co—a clause already deleted in the proof. The same problem could oc- 
cur if formula clauses are deleted. Therefore, we must remove any deletion steps 
for clauses of the skeleton or of the formula clauses from each ¢;. 

The second approach, incremental reconstruction, uses an incremental SAT 
solver, which allows the use of learned clauses when filling subsequent gaps. 
Specifically, we create an incremental problem with the steps assume(C}), ..., 
assume(C,,), assume(), where each step assume(C;), with C; = (lh V+ V Ix), 
involves assuming 1, \-- «Al; and deriving the empty clause. Each step produces a 
proof ¢;, and the complete proof $1, (a, C1), $2, (a, C2),...,(a, Cn), dn4i, (a, L) 
is correct as long as variables occurring in skeleton clauses are frozen (as de- 
scribed in Section 3). With this approach, we no longer need to worry about 
deletions of skeleton clauses or formula clauses because the solver fills each gap 
using the current clause database, i.e., each gap is proved without clauses for- 
merly deleted by the solver. 

To parallelize incremental reconstruction, we partition the incremental prob- 
lem into several independent incremental problems, which we call chunks. We 
assign k clauses C/,...,Ci+,—1 from the skeleton to each chunk, and we then use 
an incremental solver to compute partial proofs for each of the clauses, starting 
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from the formula Y% A C1 A++- A Cj_,. For each partial proof corresponding to 
a clause C;, we call the solver with the assumptions negating the clause, i.e., 
with assume(C;). Again, we must remove any deletion steps of skeleton clauses 
or formula clauses since they may be used in later chunks. All added clauses are 
then RUPs, and so the concatenation of chunk proofs is a complete proof. 

Each chunk can be solved independently in parallel. The more skeleton 
clauses in each chunk, the more clauses the incremental solver can learn and 
reuse in subsequent steps. However, gaps might differ in hardness, meaning that 
some gaps can be filled quickly while others require a significant amount of 
solving time. A chunk can thus become a bottleneck during parallelization if 
it includes many difficult gaps. In our evaluation, we partitioned the skeleton 
into chunks of equal size, one for each core. For instance, on a single core, one 
incremental problem spanning the entire skeleton was given to a solver instance 
whereas for 24 cores, the skeleton was partitioned into 24 chunks. In principle, 
we could partition a skeleton into more chunks than cores, but this would require 
an intermediary level of problem scheduling that we leave for future work. 


5.2 Shrinking Skeletons 


The runtimes for filling each gap of a proof skeleton could provide insight into 
the usefulness of the skeleton clauses. For example, if the solver can quickly fill 
a gap, the corresponding skeleton clause may be trivially implied, and if the 
solver takes long, the clause may be useful since its derivation requires a lot of 
reasoning. Alternatively, the difference in runtime might not be explained by 
clause usefulness. Take, for example, the two gaps y = C2 and w A C2 = Cs 
from Fig. 1, and assume that the solver fills the first gap in a millisecond and the 
second gap in ten seconds. If the difference is a result of C2 being trivially implied, 
it makes sense to remove C2 from the skeleton; otherwise, if the difference is 
due to factors unrelated to usefulness, it is better to remove C;. Based on this 
observation, we try to shrink a given skeleton by sorting gap reconstruction times 
and removing a certain share of the slowest or fastest clauses. 

Our empirical evaluation in Section 6 indicates that removing the fastest 
clauses is the right approach for improving compression and (sometimes) re- 
ducing reconstruction time. Even though gap runtime and clause usefulness are 
correlated, the correlation is not perfect. For instance, sometimes the incremen- 
tal solver is able to quickly fill a gap because of learning from previous steps of 
the incremental problem. Even if it takes a long time to fill a gap, there is no 
guarantee that the corresponding skeleton clause is useful for filling future gaps. 
We examine in detail how shrinking skeletons affects reconstruction time. 


5.3 Reconstructing LRAT Proofs from Skeletons 


The proof reconstruction described above will produce DRAT proofs. Formally 
verified checkers typically require LRAT proofs, forcing a conversion via a proof 
checker such as DRAT-TRIM, which can take much longer than the original 
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solving time. Instead, we can reconstruct DRAT proofs for each chunk, then 
convert the DRAT proofs to LRAT in parallel, and finally concatenate them. 

We use DRAT-TRIM to convert chunk DRAT proofs to LRAT. This re- 
quired us to modify DRAT-TRIM (e.g., by changing the way it performs back- 
wards checking, and how it handles unit clauses). By default, DRAT-TRIM starts 
backwards checking at the empty clause. But, only the last chunk will derive the 
empty clause, and further, we must ensure all skeleton clauses are included in 
the backwards check, as they may be used in later chunks. To account for this, 
we mark each skeleton clause in the DRAT proof before performing the back- 
wards check. The backwards check verifies that each marked clause is RAT (or 
RUP, in our case), including the clauses in the LRAT proof. When combining the 
chunk LRAT proofs, we map the skeleton clauses in each chunk to the index of 
the LRAT step where they were initially added. Finally, we remove all deletions 
from the LRAT proof, but this will not affect proof-checking time, mainly since 
LRAT checkers perform unit propagation in linear time using hints. While the 
following evaluation focuses on DRAT proof reconstruction from skeletons, we 
tested our implementation of parallel LRAT proof reconstruction on 24 cores, 
and verified several proofs with CAKE-LPR [19]. 


6 Experimental Evaluation 


We evaluated our approach on SAT competition 2021 Main Track benchmarks, 
using all (65) unsatisfiable formulas that were solved between 500 and 5,000 sec- 
onds by the solver CADICAL [2]. By requiring at least 500 seconds of solving 
time, we ensured that proofs are of reasonable size (around 1 GB) and there- 
fore good candidates for compression. We ran experiments on an AWS EC2 
md5d.metal instance, with 96 virtual CPUs and 500 GB of memory, running at 
most 24 parallel processes at a time. We used a timeout of 5,000 seconds for 
solving a problem and constructing a DRAT proof. For proof reconstruction on 
a single core we used a single incremental problem spanning the entire skeleton. 
For proof reconstruction on 24 cores, we evenly divided the proof skeleton into 24 
incremental problems (chunks) passed to 24 instances of CADICAL. We report 
real time for proof reconstruction, not including skeleton extraction. 


6.1 Single-Core Proof Reconstruction 


Fig. 2 shows the best configurations on each formula using online skeletons (left) 
and offline skeletons (right), for the single-core experiments (i.e., the entire skele- 
ton on a single core). Almost all proofs were reconstructed faster than the orig- 
inal solving time (below the red dotted line), and in some cases more than five 
times faster (below the blue dotted line). Each configuration was the best for 
some formulas. The GLUE configuration led the online skeletons. With a single 
incremental problem, learned clauses from earlier incremental calls can be kept 
for the entire execution, meaning that clauses that occur later in large skeletons 
(e.g., GLUE+TRAIL) may be trivially implied by previously learned clauses. 
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Fig. 2. Runtimes (in seconds) of best online (left) and offline (right) configurations for 
proof reconstruction using a proof skeleton and a single core. 
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Fig. 3. Proof skeleton compression ratio for online (left) and offline (right). 


6.2 Skeleton Compression Ratio 


Fig. 3 shows the sorted compression ratios (w.r.t. file size) between proof skele- 
tons and the original DRAT proofs for each configuration as well as the com- 
pression ratios for the configuration with the fastest reconstruction time on each 
formula (Best). For online configurations (left), the DYNAMIC skeletons have the 
most consistent compression ratios, with a tradeoff in reconstruction times. In 
some cases, skeletons can have higher compression (10,000 times) without a loss 
in performance, witnessed by the right-hand-side tail of the plot. 

For offline configurations (right), OFFLINE selects 1/1,000 of the clauses from 
the original DRAT proof. The ratios are much greater than 1,000 because skele- 
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Fig. 4. Runtimes (in seconds) for proof reconstruction of multiple online configurations 
with a single core (left) and 24 cores (right). 


tons have no deletion information and the most active clauses are typically much 
shorter than the average clause. OFFLINE-OPT provides around a factor 10 more 
compression, and these smaller skeletons provide faster reconstruction for about 
half of the formulas. In general, the compression is much better when using clause 
activity as a measure for clause importance as opposed to online heuristics (such 
as glue), with similar reconstruction times seen in Fig. 2. 


6.3 Impact of Reason Clauses in Online Skeletons 


Fig. 4 shows a comparison of reconstruction times between the GLUE and the 
GLUE+TRAIL online configurations, both on a single core (left) and on 24 cores 
(right). On a single core, creating skeletons with only low-glue clauses performs 
better than creating skeletons with low-glue clauses and reasons from the trail. 
On multiple cores, however, the reason clauses are beneficial for many reconstruc- 
tions. This may be because for parallel reconstruction, each individual chunk only 
has access to lemmas earlier in the skeleton during solving. Therefore, having 
more clauses in the skeleton will aid the later chunks. In contrast, for a single 
chunk on one core, learned clauses are kept throughout solving, and these learned 
clauses supplement the smaller skeletons. 


6.4 Impact of the UNSAT Core on Offline Skeletons 


Fig. 5 shows the effect of using an UNSAT core during reconstruction for offline 
skeletons on a single core (left) and on 24 cores (right). For the experiments 
using an UNSAT core, we remove formula clauses that are not in the UNSAT 
core before passing the formula to the solver during the incremental SAT call for 
the chunk proof. Using the UNSAT core greatly improves performance during 


342 J. E. Reeves et al. 


Single Core 24 Cores 
108. -—+— oT ee 10° f 
F + 
+ 
m a 
3 E | 3 L e H 
Si 10 a ee’ + + al 10 E + a m| 
< F PA of | F : J 
wa he + 
5 4 a; coe ie i 
+ F + 
E + aK ` 
a . we t 
ical EY + 
Zz A % * » 
2 107 ee St 4 10? a SHH J 
3 E J 4 a a ¢ 
m s + x mo 
elf 
a + 
+ 
s+ 
: + 
10} HK pe iil pete SE Lah 10} Ht fee EE Pid en et 
10! 10? 10° 10* 10! 10? 10° 10 
OFFLINE OFFLINE 


Fig. 5. Runtimes (in seconds) for OFFLINE proof reconstruction with and without an 
UNSAT core with a single core (left) and 24 cores (right). 


reconstruction on a single core. This may be because the skeleton is built from 
reasoning based on the UNSAT core, so focusing the solver on these specific 
formula clauses makes filling the gaps in the skeleton easier. The UNSAT core is 
useful in parallel reconstruction as well, producing the overall best configuration 
between online and offline skeletons. To give an idea, it takes approximately 125 
KB to store an UNSAT core as a bit vector (each bit indicating whether or not 
a clause is part of the core) for a formula with one million clauses. For most 
formulas, this data would be dominated by the size of the proof skeleton. 


6.5 Skeleton Shrinking after Reconstruction 


We discussed in Section 5.2 that it might make sense to shrink a skeleton by 
removing some amount of the fastest or of the slowest skeleton clauses. Fig. 6 
shows results for reconstruction on 24 cores using the online skeleton, removing 
either the fastest 90% or the slowest 10% of clauses. To perform the shrinking, 
we performed proof reconstruction from the skeleton and measured the solve 
times for the incremental calls, with each call corresponding to a skeleton clause. 
Removing the fastest 90% has a small impact on reconstruction time, performing 
slower for the majority of formulas. In some cases, shrinking the skeleton even 
improves performance because redundant or unnecessary clauses are removed 
from the skeleton. Removing the slowest solved clauses causes a wider variation 
in reconstruction time. This might be because these clauses are important for 
guiding the solver during reconstruction, and sometimes they lead the solver into 
unprofitable search regions that waste time. This shows two things: (1) For some 
formulas, removing only a fraction of clauses from the skeleton can lead to a big 
or small improvement, and (2) skeleton clauses are mostly nontrivial and cannot 
be added or removed randomly without a potentially consequential impact. 
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Fig. 6. Runtimes (in seconds) of proof reconstruction on 24 cores after skeleton shrink- 
ing for the DYNAMIC online configuration, removing the fastest 90% (left) or the slowest 
10% (right) of clauses from the skeleton. 
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Fig. 7. Left: Runtimes (in seconds) of original solver on a single core against proof re- 
construction on 24 cores with the best offline-skeleton configurations OFFLINE+ UNITS 
using UNSAT cores. Right: Runtimes (in seconds) of parallel SAT solvers MALLOB 
and LINGELING without proof logging against proof reconstruction with the best of- 
fline skeleton configurations using an UNSAT core, each using 24 cores. 


6.6 Comparison With Sequential and Parallel SAT Solvers 


Alternatives to our proof reconstruction could be to compute a proof on demand 
by solving a formula from scratch (either with a sequential or with a parallel SAT 
solver) or to run a parallel incremental solver that fills the gaps of a skeleton. 
The left plot of Fig. 7 shows the difference between running a sequential solver 
on a single core versus running our parallel proof reconstruction on 24 cores. For 
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the majority of formulas, parallel proof reconstruction is over five times faster, 
and in some cases closer to ten times faster. One formula had little improvement 
for reconstruction (on the red dotted line). For this formula, the final chunk took 
around 2,000 seconds to solve, and the next slowest chunk took only 24 seconds, 
meaning the hardest gaps were all clustered in the final chunk. For these sorts of 
problems, a smaller chunk size could break up the hard gaps, therefore improving 
utilization across cores and reducing the reconstruction time. 

To our knowledge, there exist no portfolio solvers or parallel incremental 
solvers that produce proofs. However, it might be possible to add proof support 
to solvers like MALLOB (a clause-sharing portfolio solver) or ILINGELING (a 
parallel incremental solver); we thus compare our approach to these solvers in 
the right plot of Fig. 7. 

The comparison to MALLOB suggests that some form of clause sharing be- 
tween solvers that solve independent chunks may improve performance. This 
could be achieved with forward clause sharing, where learned clauses can only 
be sent to solvers running on subsequent chunks. Also, MALLOB has full core 
utilization by running each solver until one derives the empty clause, but our 
proof reconstruction does not since some chunks take longer than others. With 
smaller chunk sizes and good scheduling, proof reconstruction could get closer 
to full utilization. 

ILINGELING, which is based on LINGELING [2], takes an incremental problem 
and greedily assigns steps to solver instances, terminating when one instance 
derives the empty clause. There is no clause sharing between solvers. We ran 
ILINGELING using the incremental problem derived from the proof skeleton. In 
proof reconstruction, chunks can use skeleton clauses from previous chunks, lead- 
ing to consistently better performance than ILINGELING. 


7 Conclusion 


We presented a semantic approach for compressing propositional proofs by se- 
lecting important clauses that summarize the reasoning of a solver. We store 
these clauses in a so-called proof skeleton, from which we can reconstruct a com- 
plete proof in parallel by performing multiple incremental SAT solver calls. We 
implemented our approach on top of the SAT solver CADICAL and the proof 
checker DRAT-TRIM. In an empirical evaluation, we showed that our approach 
can produce skeletons that are 100 to 5,000 times smaller than the original proofs. 
On a single core, almost all proofs were reconstructed faster than the original 
solving time, and when using 24 cores, the majority of proofs was reconstructed 
around five times faster. This is significant since proof checking typically takes 
longer than solving, and since existing parallel solvers cannot produce proofs 
while maintaining strong performance. We observed that proof skeletons not 
only serve as a compression mechanism but also provide insight into a problem. 
In future work, we thus plan to explore the connection between skeletons, proofs, 
and solver performance. 


Propositional Proof Skeletons 345 


References 


1. 


10. 


11. 


Audemard, G., Simon, L.: Predicting learnt clauses quality in modern SAT solvers. 
In: Boutilier, C. (ed.) IJCAI 2009, Proceedings of the 21st International Joint 
Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009. 
pp. 399-404 (2009), http: //ijcai.org/Proceedings/09/Papers/074.pdf 

Biere, A., Fazekas, K., Fleury, M., Heisinger, M.: CaDiCaL, Kissat, Paracooba, 
Plingeling and Treengeling entering the SAT Competition 2020. In: Balyo, T., 
Froleyks, N., Heule, M., Iser, M., Jarvisalo, M., Suda, M. (eds.) Proc. of SAT 
Competition 2020 — Solver and Benchmark Descriptions. Department of Computer 
Science Report Series B, vol. B-2020-1, pp. 51-53. University of Helsinki (2020) 
Blanchette, J.C., Böhme, S., Paulson, L.C.: Extending sledgehammer with SMT 
solvers. J. Autom. Reason. 51(1), 109-128 (2013) 

Boudou, J., Fellner, A., Paleo, B.W.: Skeptik: A proof compression system. In: 
Demri, S., Kapur, D., Weidenbach, C. (eds.) Automated Reasoning - 7th In- 
ternational Joint Conference, IJCAR 2014, Held as Part of the Vienna Sum- 
mer of Logic, VSL 2014, Vienna, Austria, July 19-22, 2014. Proceedings. Lec- 
ture Notes in Computer Science, vol. 8562, pp. 374-380. Springer (2014), https: 
/ /doi.org/10.1007/978-3-319-08587-6_ 29 

Cruz-Filipe, L., Heule, M.J.H., Jr., W.A.H., Kaufmann, M., Schneider-Kamp, P.: 
Efficient certified RAT verification. In: de Moura, L. (ed.) Automated Deduction - 
CADE 26 - 26th International Conference on Automated Deduction, Gothenburg, 
Sweden, August 6-11, 2017, Proceedings. Lecture Notes in Computer Science, vol. 
10395, pp. 220-236. Springer (2017), https://doi.org/10.1007/978-3-319-63046-5 __ 
14 

Eén, N., Sörensson, N.: An extensible SAT-solver. In: Giunchiglia, E., Tacchella, 
A. (eds.) Theory and Applications of Satisfiability Testing, 6th International Con- 
ference, SAT 2003. Santa Margherita Ligure, Italy, May 5-8, 2003 Selected Re- 
vised Papers. Lecture Notes in Computer Science, vol. 2919, pp. 502-518. Springer 
(2003), https: //doi.org/10.1007/978-3-540-24605-3 37 

Eén, N., Sörensson, N.: Temporal induction by incremental SAT solving. Elec- 
tron. Notes Theor. Comput. Sci. 89(4), 543-560 (2003), https: //doi.org/10.1016/ 
$1571-0661(05)82542-3 

Fazekas, K., Biere, A., Scholl, C.: Incremental inprocessing in SAT solving. In: 
Janota, M., Lynce, I. (eds.) Theory and Applications of Satisfiability Testing - 
SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 
2019, Proceedings. Lecture Notes in Computer Science, vol. 11628, pp. 136-154. 
Springer (2019), https://doi.org/10.1007/978-3-030-24258-9 9 

Heule, M., Jr., W.A.H., Kaufmann, M., Wetzler, N.: Efficient, verified checking of 
propositional proofs. In: Ayala-Rincoén, M., Muñoz, C.A. (eds.) Interactive Theo- 
rem Proving - 8th International Conference, ITP 2017, Brasilia, Brazil, September 
26-29, 2017, Proceedings. Lecture Notes in Computer Science, vol. 10499, pp. 269- 
284. Springer (2017), https://doi.org/10.1007/978-3-319-66107-0 18 

Heule, M.J.H.: The DRAT format and drat-trim checker. CoRR abs/1610.06229 
(2016), http: //arxiv.org/abs/1610.06229 

Heule, M.J.H.: Schur number five. In: Mcllraith, S.A., Weinberger, K.Q. (eds.) 
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 
(AAAI-18), the 30th innovative Applications of Artificial Intelligence ([AAI-18), 
and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence 
(EAAI-18). pp. 6598-6606. AAAI Press (2018), https://www.aaai.org/ocs/index. 
php/AAAT/AAAT18/paper/view/16952 


346 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


J. E. Reeves et al. 


Heule, M.J.H., Kullmann, O., Marek, V.W.: Solving and verifying the boolean 
pythagorean triples problem via cube-and-conquer. In: Creignou, N., Le Berre, D. 
(eds.) Theory and Applications of Satisfiability Testing - SAT 2016. pp. 228-245. 
Springer International Publishing, Cham (2016) 

Heule, M.J.H., Kullmann, O., Wieringa, S., Biere, A.: Cube and conquer: Guid- 
ing CDCL SAT solvers by lookaheads. In: Eder, K., Lourenço, J., Shehory, O. 
(eds.) Hardware and Software: Verification and Testing. pp. 50-65. Springer Berlin 
Heidelberg, Berlin, Heidelberg (2012) 

Lammich, P.: Efficient verified (UN)SAT certificate checking. J. Autom. Reason. 
64(3), 513-532 (2020), https: //doi.org/10.1007/s10817-019-09525-z 
Marques-Silva, J.P., Sakallah, K.A.: GRASP: A search algorithm for propositional 
satisfiability. IEEE Trans. Computers 48(5), 506-521 (1999), https://doi.org/10. 
1109/12.769433 

Moskewicz, M.W., Madigan, C.F., Zhao, Y., Zhang, L., Malik, S.: Chaff: Engi- 
neering an efficient SAT solver. In: Proceedings of the 38th Design Automation 
Conference, DAC 2001, Las Vegas, NV, USA, June 18-22, 2001. pp. 530-535. ACM 
(2001), https: //doi.org/10.1145/378239.379017 

Notzli, A., Barbosa, H., Niemetz, A., Preiner, M., Reynolds, A., Barrett, C., Tinelli, 
C.: Reconstructing fine-grained proofs of rewrites using a domain-specific language. 
In: Griggio, A., Rungta, N. (eds.) Formal Methods in Computer-Aided Design - 
22nd Conference, FMCAD 2022, Trento, Italy, October 17-21, 2022, Proceedings. 
pp. 65-74. Formal Methods in Computer-Aided Design, TU Wien Academic Press 
(2022) 

Rollini, S.F., Bruttomesso, R., Sharygina, N., Tsitovich, A.: Resolution proof trans- 
formation for compression and interpolation. Formal Methods Syst. Des. 45(1), 
1-41 (2014), https: //doi.org/10.1007/s10703-014-0208-x 

Tan, Y.K., Heule, M.J.H., Myreen, M.O.: cake_Ipr: Verified propagation redun- 
dancy checking in CakeML. In: Groote, J.F., Larsen, K.G. (eds.) Tools and Al- 
gorithms for the Construction and Analysis of Systems - 27th International Con- 
ference, TACAS 2021, Held as Part of the European Joint Conferences on Theory 
and Practice of Software, ETAPS 2021, Luxembourg City, Luxembourg, March 27 - 
April 1, 2021, Proceedings, Part II. Lecture Notes in Computer Science, vol. 12652, 
pp. 223-241. Springer (2021), https: //doi-org/10.1007/978-3-030-72013-1 12 
Van Gelder, A.: Verifying RUP proofs of propositional unsatisfiability. 
In: International Symposium on Artificial Intelligence and Mathematics, 
ISAIM 2008, Fort Lauderdale, Florida, USA, January 2-4, 2008 (2008), 
http://isaim2008.unl.edu/PAPERS/TechnicalProgram/ISAIM2008 0008 _ 
60a1f9b2fd607a61lec9e0feac3f438f8. pdf 

Vyskocil, J., Stanovsky, D., Urban, J.: Automated proof compression by invention 
of new definitions. In: Clarke, E.M., Voronkov, A. (eds.) Logic for Programming, 
Artificial Intelligence, and Reasoning - 16th International Conference, LPAR-16, 
Dakar, Senegal, April 25-May 1, 2010, Revised Selected Papers. Lecture Notes 
in Computer Science, vol. 6355, pp. 447-462. Springer (2010), https://doi.org/10. 
1007/978-3-642-17511-4 25 

Wetzler, N., Heule, M.J.H., Hunt, W.A.: DRAT-trim: Efficient checking and trim- 
ming using expressive clausal proofs. In: Theory and Applications of Satisfiability 
Testing (SAT). LNCS, vol. 8561, pp. 422-429 (2014) 


Propositional Proof Skeletons 347 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Check for 
updates 


Unsatisfiability Proofs for 
Distributed Clause-Sharing SAT Solvers 


Dawn Michaelson? ®9®, Dominik Schreiber? =) @, Marijn J. H. Heule!*@, 
Benjamin Kiesl-Reiter'®, and Michael W. Whalen!?®@ 


1 Amazon Web Services, Seattle, USA 
2 University of Minnesota, Minneapolis, USA 
micha576@umn. edu 
3 Karlsruhe Institute of Technology, Karlsruhe, Germany 
dominik.schreiber@kit.edu 
4 Carnegie Mellon University, Pittsburgh, USA 


Abstract. Distributed clause-sharing SAT solvers can solve problems 
up to one hundred times faster than sequential SAT solvers by shar- 
ing derived information among multiple sequential solvers working on 
the same problem. Unlike sequential solvers, however, distributed solvers 
have not been able to produce proofs of unsatisfiability in a scalable man- 
ner, which has limited their use in critical applications. In this paper, 
we present a method to produce unsatisfiability proofs for distributed 
SAT solvers by combining the partial proofs produced by each sequen- 
tial solver into a single, linear proof. Our approach is more scalable and 
general than previous explorations for parallel clause-sharing solvers, al- 
lowing use on distributed solvers without shared memory. We propose a 
simple sequential algorithm as well as a fully distributed algorithm for 
proof composition. Our empirical evaluation shows that for large-scale 
distributed solvers (100 nodes of 16 cores each), our distributed approach 
allows reliable proof composition and checking with reasonable overhead. 
We analyze the overhead and discuss how and where future efforts may 
further improve performance. 


Keywords: SAT solving - proofs - distributed computing. 


1 Introduction 


SAT solvers are general-purpose tools for solving complex computational prob- 
lems. By encoding domain problems into propositional logic, users have suc- 
cessfully applied SAT solvers in various fields such as formal verification [31], 
automated planning [25], and mathematics [8,16]. The list of applications has 
grown significantly over the years, mainly because algorithmic improvements 
have led to orders of magnitude improvement in the performance of the best 
sequential solvers (see, e.g., [21] for a comparison). 

Despite all this progress, there are still many problems that cannot be solved 
quickly with even the best sequential solvers, pushing researchers to explore 
ways of parallelizing SAT solving. One approach that has worked well for specific 
problem instances is Cube-and-Conquer [17,18], which can achieve near-linear 
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speedups for thousands of cores but requires domain knowledge about how ef- 
fectively to split a problem into subproblems. An alternative approach that does 
not require such knowledge is clause-sharing portfolio solving, which has recently 
led to solvers [12,28] achieving impressive speedups (10x—100x on a 100x16 core 
cluster) over the best sequential solvers across broad sets of benchmarks.” 

Although distributed solvers are demonstrably the most powerful tools for 
solving hard SAT problems, there is an important caveat: unlike sequential 
solvers, current distributed clause-sharing solvers cannot produce proofs of un- 
satisfiability. While there has been foundational work in producing proofs for 
shared-memory clause-sharing SAT solvers [14], existing approaches are neither 
scalable nor general enough for large-scale distributed solvers. This is not just a 
theoretical problem—for four problems in the 2020 and 2021 SAT competitions, 
distributed solvers produced incorrect answers that were not discovered until the 
2022 competition because they could not be independently verified.® 

In this paper, we deal with this issue and present the first scalable approach 
for generating proofs for distributed SAT solvers. To construct proofs, we main- 
tain provenance information about shared clauses in order to track how they 
are used in the global solving process, and we use the recently-developed LRAT 
proof format [9] to track dependencies among partial proofs produced by solver 
instances. By exploiting these dependencies, we are then able to reconstruct a 
single linear proof from all the partial proofs produced by the sequential solvers. 
We first present a simple sequential algorithm for proof reconstruction before 
devising a parallel algorithm that can even be implemented in a distributed way. 
Both algorithms produce independently-verifiable proofs in the LRAT format. 
We demonstrate our approaches using an LRAT-producing version of the se- 
quential SAT solver CaDiCaL [5] to turn it into a clause-sharing solver, and 
then modify the distributed solver Mallob [28] to orchestrate a portfolio of such 
CaDiCaL instances while tracking the IDs of all shared clauses. 

We conduct an evaluation of our approaches from the perspective of efficiency, 
benchmarking the performance of our clause-sharing portfolio solver against the 
winners of the cloud track, parallel track, and sequential track from the SAT 
Competition 2022. Adding proof support introduces several kinds of overhead 
for clause-sharing portfolios in terms of solving, proof reconstruction, and proof 
checking, which we examine in detail. We show that even with this overhead, dis- 
tributed solving and proving is much faster than the best sequential approaches. 
We also demonstrate that our approach dramatically outperforms previous work 
on proof production for clause-sharing portfolios [14]. We argue that much of the 
overhead of our current setup can be compensated, among other measures, by 
improving support for LRAT in solver backends. We thus hope that our work 
provides an impetus for researchers to add LRAT support to other solvers. 

Our main contributions are as follows: 


5 e.f.: the SAT Competition 2022 results: 
https://satcompetition. github.io/2022/downloads/sc2022-detailed-results.zip 

6 The incorrectly scored problems were SAT_MS_sat_nurikabe_p08.pdd1_71.cnf, 
randomG-Mix-n18-d05.cnf, phpi2e12.cnf, and Cake_9_20.cnf. 
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— We present the first effective and scalable approach for proof generation in 
distributed SAT solving. 

— We implement our approach on top of the state-of-the-art solvers CaDiCaL 
and Mallob. 

— We perform a large-scale empirical evaluation analyzing the overhead intro- 
duced by proof production as compared to state-of-the-art portfolios. 

— We demonstrate that our approach dramatically outperforms previous work 
in parallel proof production, and that it remains substantially more scalable 
than the best sequential solvers. 


The rest of this paper is structured as follows. In Section 2, we present the 
background required to understand the rest of our paper and discuss related 
work. In Section 3, we describe the general problem of producing proofs for 
distributed SAT solving and a simple algorithm for proof combination. In Sec- 
tion 4, we describe a much more efficient distributed version of our algorithm 
before discussing implementation details in Section 5. Finally, we present the 
results of our empirical evaluation in Section 6 and conclude with a summary 
and an outlook for future work in Section 7. 


2 Background and Related Work 


The Boolean satisfiability problem (SAT) asks whether a Boolean formula can 
be satisfied by some assignment of truth values to its variables. An overview can 
be found in [6]. We consider formulas in conjunctive normal form (CNF). As 
such, a formula F is a conjunction (logical “AND”) of disjunctions (logical “OR”) 
of literals, where a literal is a Boolean variable or its negation. For example, 
(AV bV c) A (bV) (a) is a formula with variables a, b, c and three clauses. 
A truth assignment A maps each variable to a Boolean value (true or false). A 
formula F is satisfied by an assignment A if F evaluates to true under A, and 
F is satisfiable if such an assignment exists. Otherwise, F is called unsatisfiable. 

If a formula F is found to be satisfiable, modern SAT solvers commonly 
output a truth assignment; users can easily evaluate F under the assignment in 
linear time to verify that F is indeed satisfiable. In contrast, if a formula turns 
out unsatisfiable, sequential SAT solvers produce an independently-checkable 
proof that there exists no assignment that satisfies the formula. 


File Formats in Practical SAT Solving. In practical SAT solving, formulas are 
specified in the DIMACS format. DIMACS files feature a header of the form 
‘p cnf #variables #clauses’ followed by a list of clauses, one clause per line. 
For example, the clause (xı V T2 V #3) is represented as ‘1 -2 3 0’. An example 
formula in DIMACS format is given in Figure 1. 

The current standard format for proofs is DRAT [15]. DRAT files are similar 
to DIMACS files, with each line containing a proof statement that is either an 
addition or a deletion. Additions are lines that represent clauses like in the DI- 
MACS format; they identify clauses that were derived (“learned”) by the solver. 
Each clause addition must preserve satisfiability by adhering to the so-called 


Unsatisfiability Proofs for Distributed Clause-Sharing SAT Solvers 351 


0) 
(0 
0) 
0) 
(0 
0) 
0) 
0) 


Fig. 1: DIMACS formula and corresponding proofs in DRAT and LRAT format. 


RAT criterion—as the details of RAT are not essential to our paper, we refer 
the reader to the respective literature for more details [20]. Deletions are lines 
that start with a ‘d’, followed by a clause; they identify clauses that were deleted 
by the solver because they were not deemed necessary anymore. Clause deletions 
can only make a formula “more satisfiable”, meaning that they aren’t required 
for deriving unsatisfiability, but they drastically speed up proof checking. A valid 
DRAT proof of unsatisfiability ends with the derivation of the empty clause. As 
the empty clause is trivially unsatisfiable (and since each proof step preserves 
satisfiability) the unsatisfiability of the original formula can then be concluded. 
An example DRAT proof is given in Figure 1. 

The more recent LRAT proof format [9] augments each clause-addition step 
with so-called hints, which identify the clauses that were required to derive the 
current clause. This makes proof checking more efficient, and in fact the usual 
pipeline for trusted proof checking is to first use an efficient but unverified tool 
(like DRAT-trim [15]) to transform a DRAT proof into an LRAT proof, and 
then check the resulting LRAT proof with a formally verified proof checker (c.f., 
[9, 13, 22,30]). Figure 1 shows an LRAT proof corresponding to a DRAT proof. 
Each proof line starts with a clause ID. The numbering starts with 9 because 
the eight clauses of the original formula are assigned the IDs 1 to 8. Each clause 
addition first lists the literals of the clause, then a terminating 0, followed by 
hints (in the form of clause IDs), and finally another 0. For example, clause 
9 contains the literal -3 and can be derived from the clauses 4 and 5 of the 
original formula. Clause deletions just state the clause ID of the clause that is 
to be deleted, as in the later deletion of clause 9. In our work, we exploit the 
hints of LRAT to determine dependencies among distributed solvers. 


Parallel and Distributed SAT Solving. One way to parallelize SAT solving is to 
run a portfolio of sequential solvers in parallel and to consider a problem solved 
as soon as one of the solvers finishes (c.f. [1, 4,5, 11, 12, 18, 23, 29, 32]). Given 
that the solvers are sufficiently diverse, portfolio solving is already effective if 
all of the sequential solvers work independently, but performance and scalability 
can be boosted significantly by having the solvers share information in the form 
of learned clauses [4,12]. This approach is taken by the distributed solver Mal- 
lob [28], which won the cloud track of the last three SAT competitions [2,3,27]. 
As opposed to other solvers, Mallob relies on a communication-efficient aggrega- 
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tion strategy to collect the globally most useful learned clauses and to reliably 
filter duplicates as well as previously shared clauses [27]. With this strategy, 
which aims to maximize the density and utility of the communicated data, Mal- 
lob scored first place in all four eligible subtracks for unsatisfiable problems at 
the 2022 SAT Competition. 

As we discuss in more detail later, the drawback of clause sharing is that a 
local proof written by an individual solver may contain clauses whose deriva- 
tions cannot be justified because they rely on clauses imported from another 
solver. Previous work focuses on writing DRAT proofs for clause-sharing par- 
allel solvers [14]. In that work, solvers write to the same shared proof as they 
learn clauses. However, since the clauses are shared, one solver deleting a clause 
could invalidate a later clause-addition by another solver that is still holding the 
clause. To handle this, the parallel solver moderates deletion statements, only 
writing them to the proof once all solvers have deleted a clause, which leads to 
poor scalability during proof search. In our approach, solvers write proof files 
fully independently—only when the unsatisfiability of the problem has been de- 
termined do we combine all proofs into a single valid proof. 

Other recent work includes reconstructing proofs from divide-and-conquer 
solvers [24] and from a particular shared-memory parallel solver [10] whereas we 
aim to exploit distributed portfolio solving. 


3 Basic Proof Production 


Our goal is to produce checkable unsatisfiability proofs for problems solved by 
distributed clause-sharing SAT solvers. We propose to reuse the work done on 
proofs for sequential solvers by having each solver produce a partial proof con- 
taining the clauses it learned. These partial proofs are invalid in general because 
each sequential solver can rely on clauses shared by other solvers when learning 
new clauses. For example, when solver A derives a new clause, it might rely on 
clauses from solvers B and C, which in turn relied on clauses from solvers D 
and E, and so on. The justification of A’s clause derivation is thus spread across 
multiple partial proofs. We need to combine the partial proofs into a single valid 
proof in which the clauses are in dependency order, meaning that each clause 
can be derived from previous clauses. 

To generate an efficiently-checkable combined proof in a scalable way, we 
must solve three challenges: 

1. Provide metadata to identify which solver produced each learned clause. 

2. Efficiently sort learned clauses in dependency order across all solvers. 

3. Reduce proof size by removing unnecessary clauses. 

Switching from DRAT to the LRAT proof format provides the mechanism to 
unlock all three challenges. First, we specialize the clause-numbering scheme used 
by LRAT in order to distinguish the clauses produced by each solver. Second, 
we use the dependency information from LRAT to construct a complete proof 
from the partial proofs produced by each solver. Finally, we determine which 
clauses are unnecessary (or used only for certain parts of the proof) to delete 
clauses from the proof as soon as they are no longer required. 
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Algorithm 1 Algorithm for combining partial proofs 


1: function COMBINE(partial proofs p1, p2, ...Pn, number of original clauses o) 
2: tel 


3: while true do 

4: if p;.hasNeat() then 

5: (id, type, clause, proofHint) + p;.peekNeat() 

6: if dependenciesSatisfied(proofHint) then 

7: emit (id, type, clause, proofHint) 

8: pi-next() > Line completed 
9: if clause = 9 then > Derived empty clause 
10: return 

11: else > Leave the line and move to next partial proof 
12: i+ (i mod n)+1 

13: else > Move to next partial proof if current is done 
14: i+ (i mod n)+1 


We update the clause-distribution mechanism in the distributed solver to 
broadcast the clause ID with each learned clause. A receiving solver stores the 
clause with its ID and uses the ID in proof hints when the clause is used locally, 
as it does with locally-derived clauses. Unlike locally-derived clauses, we add no 
derivation lines for remote clauses to the local proof. Instead, these derivations 
will be added to the final proof when combining the partial proofs. 


3.1 Solver Partial Proof Production 


To combine the partial proofs into a complete proof, we modify the mechanism 
producing LRAT proofs in each of the component solvers. We assign to each 
clause an ID that is unique across solvers and identifies which solver originally 
derived it. The following mapping from clauses to IDs achieves this: 


Definition 1. Let o be the number of clauses in the original formula and let 
n be the number of sequential solvers. Then, the ID of the k-th derived clause 
(k > 0) of solver i is defined as ID, =o +i+nk. 


Given [Dj,, we can easily determine the solver ID i using modular arithmetic. 


3.2 Partial Proof Combination 


Once the distributed solver has concluded the input formula is unsatisfiable, we 
have n partial proofs. The clause derivations in these proofs refer to clauses of 
other partial proofs, but they are, locally, in dependency order. We can therefore 
combine the partial proofs without reordering their clauses beforehand. We can 
simply interleave their clauses so the resulting proof is also in dependency order, 
ignoring any deletions in the partial proofs. 

Our algorithm goes through the partial proofs round-robin, at each step 
emitting all the clauses from each file where the dependencies of the clause have 
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INSTANCE 1 : CoMBINED 


Fig. 2: Partial proofs and combined proof of unsatisfiability. 


already been emitted. It ends when the empty clause is emitted. The procedure 
is shown in Algorithm 1. For each partial proof, we maintain an iterator over the 
learned clauses. We add the next clause from the current partial proof (p;) to the 
final proof if its dependencies are satisfied (determined by comparing each hint 
to the last clause emitted from the partial proof whence it originated); otherwise 
it cycles to the next partial proof. It emits the line and moves to the next clause 
in the file. The algorithm terminates when it emits the empty clause (line 10). 


Example 1. Suppose that two solver instances (instance 1 and instance 2) de- 
termined together that the formula from Figure 1 is unsatisfiable, with the two 
partial proofs shown in Figure 2. We start with instance 1. As clause 9 only relies 
on original clauses, we emit it. Clause 11 relies on original clause 6 and emitted 
clause 9, so we emit it. Clause 13 relies on clauses 8 and 12, which is not emitted, 
so we cannot emit clause 13 and move to instance 2. Clause 10 can be emitted, 
as can clause 12, which relies on an original and an emitted clause. Clause 14 
relies on emitted clauses 11 and 10 and on original clause 1, so we can emit it as 
well. Since clause 14 is the empty clause, we finish with a complete proof, shown 
in Figure 2(c). Notice that clause 13 was not added to the combined proof, since 
it was not required to satisfy any dependencies of the empty clause. 


3.3 Proof Pruning 


The combined proof produced by our procedure is valid but not efficiently check- 
able because (1) it can contain clauses that are not required to derive the empty 
clause and (2) it does not contain deletion lines, meaning that a proof checker 
must maintain all learned clauses in memory throughout the checking process. 
To reduce size and to improve proof-checking performance, we prune our com- 
bined proof toward a minimal proof containing only necessary clauses, and we 
add deletion statements for clauses as soon as they are not needed anymore. 
Algorithm 2 shows our pruning algorithm that walks the combined proof in 
reverse (similar to backward checking of DRAT proofs [19]). We maintain a set of 
clauses required in the proof, initialized to the empty clause alone. We then pro- 
cess all clauses in reverse order, including the empty clause, ignoring all clauses 
not in the required set. For each required clause, we check its dependencies to 
see if this is the first time (from the proof’s end) a dependency is seen; if so, 
we emit a deletion line for the dependency since it will never be used again in 
the proof. After checking all its dependencies, we output the clause itself. The 
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Algorithm 2 Algorithm for pruning proofs 


1: function PRUNE(combined and reversed proof p, number of original clauses 0) 
2: required + {p.peekNeaxtId()} > Must be empty clause, which is required 
while p.hasNezxt() do 
(id, type, clause, proofHint) + p.readNezxt() 
if id € required then > Only process a line if it is required later 
for hint € proofHint do 
if hint >o A hint ¢ required then > Not used later 
required + required U {hint} 
emit (id, delete, hint) 
10: emit (id, add, clause, proofHint) 


final output of the algorithm is a proof in reversed order, where each clause is 
required for some derivation and deleted as soon as it is no longer required. 


Example 2. Consider the combined proof from Figure 2. After applying Algo- 
rithm 2, working backward from clause 14, we determine that clause 12 is not 
required, so it is removed. Additionally, prior to clause 11, clause 9 is not in the 
required set, so it can be deleted after processing clause 11. On larger proofs, as 
discussed in Section 6, pruning can reduce the size of the proof by 10x or more. 


4 Distributed Proof Production 


The proof production as described above is sequential and may process huge 
amounts of data, all of which needs to be accessible from the machine that 
executes the procedure. In addition, maintaining the required clause IDs during 
the procedure may require a prohibitive amount of memory for large proofs. In 
the following, we propose an efficient distributed approach to proof production. 


4.1 Overview 


Our previous sequential proof-combination algorithm first combines all partial 
proofs into a single proof and then prunes unneeded proof lines. In contrast, 
our distributed algorithm first prunes all partial proofs in parallel and only then 
merges them into a single file. 

We have m processes with c solver instances each, amounting to a total of 
n = mc solvers. We make use of the fact that the solvers exchange clauses in 
periodic intervals (one second by default). We refer to these intervals between 
subsequent sharing operations as epochs. Consider Fig. 3 (left): Clause 118 was 
produced by S% in epoch 1. Its derivation may depend on local clause 114 and on 
any of the 11 clauses produced in epoch 0, but it cannot depend, e.g., on clause 
109 or 111 since these clauses have been produced after the last clause sharing. 
More generally, a clause c produced by instance 7 during epoch e can only depend 
on (i) earlier clauses by instance i produced during epoch e or earlier, and (ii) 
clauses by instances j Æ i produced before epoch e. 
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— Produced clauses > — Produced clauses > 
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Epoch 0} Epoch 1 Epoch 2 Epoch 0 Epoch 1 Epoch 2 
Sharing Sharing Sharing Sharing 


Fig. 3: Four solvers work on a formula with 99 original clauses, produce new 
clauses (depicted by their ID), and share clauses periodically, without (left) and 
with (right) aligning clause IDs. 


Using this knowledge, we can essentially rewind the solving procedure. Each 
process reads its partial proofs in reverse order, outputs each line which adds a 
required clause, and adds the hints of each such clause to the required clauses. 
Required remote clauses produced in epoch e are transferred to their process of 
origin before any proof lines from epoch e are read. As such, whenever a process 
reads a proof line, it knows whether the clause is required. The outputs of all 
processes can be merged into a single valid proof (Section 4.3). 


4.2 Distributed Pruning 


Clause ID Alignment. To synchronize the reading and redistribution of clause 
IDs in our distributed pruning, we need a way to decide from which epoch a 
remote clause ID originates. However, solvers generally produce clauses with 
different speeds, so the IDs by different solvers will likely be in dissimilar ranges 
within the same epoch over time. For instance, in Fig. 3 (left) instance S3 has no 
way of knowing from which epoch clause 118 originates. To solve this issue, we 
propose to align all produced clause IDs after each sharing. During the solving 
procedure, we add a certain offset ôf to each ID produced by instance 7 in epoch 
e. As such, we can associate each epoch e with a global interval [Ae, Ae+1) that 
contains all clause IDs produced in that epoch. In Fig. 3 (right), Ap = 100, 
A, = 116, and Az = 128. Clause 118 on the left has been aligned to 122 on the 
right (6; = 4) and due to A; < 122 < Ap all instances know that this clause 
originates from epoch 1. 

Initially, 5° := 0 for all i. Let JF be the first original (unaligned) ID produced 
by instance 7 in epoch e. With the sharing that initiates epoch e > 0, we compute 
the common start of epoch e, Ae := max;{I¢ + ôf" — i}, as the lowest possible 
value that is larger than all clause IDs from epoch e—1. We then compute offsets 
ôF in such a way that If +6 = Ae +i, which yields ô? := (Ae +i) —I¢. If we then 
export a clause produced during e by instance i, we add ô? to its ID, and if we 
import shared clauses to 7, we filter any clauses produced by 7 itself. Note that 
we do not modify the solvers’ internal ID counters or the proofs they output. 
Later, when reading the partial proof of solver i at epoch e, we need to add ôf 
to each ID originating from i. All other clause IDs are already aligned. 
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Rewinding the Solve Procedure. Assume that instance u € {1,...,n} has derived 
the empty clause in epoch ê. For each local solver i, each process has a frontier 
F; of required clauses produced by i. In addition, each process has a backlog B 
of remote required clauses. B and F; are collections of clause IDs and can be 
thought of as maximum-first priority queues. Initially, F,, contains the ID of the 
empty clause while all other frontiers and backlogs are empty. Iteration x > 0 of 
our algorithm processes epoch é — x and features two stages: 

1. Processing: Each process continues to read its partial proofs in reverse 
order from the last introduced clause of the current epoch. If a line from solver 
i is read whose clause ID is at the top of F;, then the ID is removed from F}, 
the line is output, and each clause ID hint h in the line is treated as follows: 

— his inserted in F} if local solver j (possibly j = i) produced h. 

— h is inserted in B if a remote solver produced A. 

— his dropped if h is an ID of an original clause of the problem. 
Reading stops as soon as a line’s ID precedes epoch e = ê-— x. Each F; as well 
as B now only contain clauses produced before e. 

2. Task redistribution: Each process extracts all clause IDs from B that were 
produced during é— x — 1. These clause IDs are aggregated among all processes, 
eliminating duplicates in the same manner as Mallob’s clause sharing detects 
duplicate clauses [28]. Each process traverses the aggregated clause IDs, and 
each clause produced by a local solver i is added to F;. 

Our algorithm stops in iteration é after the Processing stage, at which point 
all frontiers and backlogs are empty and all relevant proof lines have been output. 


Analysis. In terms of total work performed, all partial proofs are read completely. 
For each required clause we may perform an insertion into some B, a deletion 
from said B, an insertion into some F;, and a deletion from said F;. If we assume 
logarithmic work for each insertion and deletion, the work for these operations 
is linear in the combined size of all partial proofs and loglinear in the size of the 
output proof. In addition, we have ê iterations of communication whose overall 
volume is bounded by the communication done during solving. In fact, since only 
a subset of shared clauses are required and we only share 64 bits per clause, we 
expect strictly less communication than during solving. Computing A, for each 
epoch e during solving is negligible since the necessary aggregation and broadcast 
can be integrated into an existing collective operation. Regarding memory usage, 
the size of each B and each F; can be proportional to the combined size of 
all required lines of the according partial proofs. However, we can make use of 
external data structures which keep their content on disk except for a few buffers. 


4.3 Merging Step 


For each partial proof processed during the pruning step, we have a stream of 
proof lines sorted in reverse chronological order, i.e., starting with the highest 
clause ID. The remaining task is to merge all these lines into a single, sorted 
proof file. As shown in Fig. 4 (left), we arrange all processes in a tree. We can 
easily merge a number of sorted input streams into a single sorted output stream 
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Fig. 4: Left: Proof merging with seven processes and 14 solvers. Each box rep- 
resents a process with two local proof sources. Dashed arrows denote commu- 
nication. Right: Example of merging three streams of LRAT lines into a single 
stream. Each number 7 represents an LRAT line describing a clause of ID i. 


by repeatedly outputting the line with the highest ID among all inputs (Fig. 4 
right). This way, we can hierarchically merge all streams along the tree. At the 
tree’s root, the output stream is directed into a file. This is a sequential I/O task 
that limits the speed of merging. Finally, since the produced file is in reverse 
order, a buffered operation reverses the file’s content. 

A final challenge is to add clause deletions to the final proof. Before a line is 
written to the combined proof file, we can scan its hints and output a deletion 
line for each hint we did not encounter before (see Section 3.3). However, imple- 
menting this in an exact manner requires maintaining a set of clause [Ds which 
scales with the final proof size. Since our proof remains valid even if we omit 
some clause deletions, we can use an approximate membership query (AMQ) 
structure with fixed size and a small false positive rate, e.g., a Bloom filter [7]. 


5 Implementation 


We employ a solver portfolio based on the sequential SAT solver CaDiCaL [5]. 
We modified CaDiCaL to output LRAT proof lines and to assign clause IDs as 
described in Section 3.1. To ensure sound LRAT proof logging, some features of 
CaDiCaL currently need to be turned off, such as bounded variable elimination, 
hyper-ternary resolution, and vivification. Similarly, Mallob’s original portfolio 
of CaDiCaL configurations features several options that are incompatible with 
our proof logging as of yet. Therefore, we created a smaller portfolio of “safe” 
configurations that include shuffling variable priorities, adjusted restart intervals, 
and disabled inprocessing. We also use different random seeds and use Mallob’s 
diversification based on randomized initial variable polarities. 

We modified Mallob to associate each clause with a 64-bit clause ID. For 
consistent bookkeeping of sharing epochs, we defer clause sharing until all pro- 
cesses have fully initialized their solvers. While several solvers may derive the 
empty clause simultaneously, only one of them is selected to be the “winner” 
whose empty clause will be traced. The distributed proof production features 
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communication similar to Mallob’s clause sharing. To realize the frontier F; and 
the backlog B described in Section 4.2, we implemented an external-memory 
data structure which writes clause IDs to disk, categorized by their epoch. Upon 
reaching a new epoch, all clause IDs from this epoch are read from disk and in- 
serted into an internal priority queue to allow for efficient polling and insertion. 
To merge the pruned partial proofs, we use point-to-point messages to query and 
send buffers of proof lines between processes. We interleave this merging with 
the pruning procedure in order to avoid writing the intermediate output to disk. 
We use a fixed-size Bloom filter to add some deletion lines to the final proof. 


6 Evaluation 


In this section, we present an evaluation of our proof production approaches. We 
provide the associated software as well as a digital appendix online.’ 


6.1 Experimental Setup 


Supporting proofs introduces several kinds of performance overhead for clause- 
sharing portfolios in terms of solving, proof reconstruction, and proof checking. 
We wish to examine how well our proof-producing solver performs against (1) 
best-of-breed parallel and cloud solvers that do not produce proofs, (2) previous 
approaches to proof-producing parallel solvers, and (3) best-of-breed sequential 
solvers. We analyze the overhead introduced by each phase of the process, and 
we discuss how and where future efforts might improve performance. 

We use the following pipeline for our proof-producing solvers: First, the in- 
put formula is preprocessed by performing exhaustive unit propagation. This is 
necessary due to a technical limitation of our LRAT-producing modification of 
CaDiCaL. Second, we execute our proof-producing variant of Mallob on the pre- 
processed formula. Third, we prune and combine all partial proofs, using either 
our sequential proof production or our distributed proof production. Fourth, we 
merge the preprocessor’s proof and our produced proof and syntactically trans- 
form the result to bring the set of clause IDs into compact shape. Fifth and 
finally, we run lrat-check® to check the final proof. Only steps two and three 
of our pipeline are parallelized (step three depending on the particular experi- 
ment). We will refer to the first two steps as solving, the third step as assembly, 
the fourth step as postprocessing, and the fifth step as checking. 

To examine performance overhead for proof-producing parallel and dis- 
tributed solvers, we compare our proof-producing cloud and parallel solvers 
(mallob-cacld-p and mallob-capar-p) against six solvers. First, we include 
the winners of the 2022 SAT competition cloud track (mallob-kicaliglu, us- 
ing Kissat+CaDiCaL+Lingeling+Glucose), parallel track (parkissat-rs, using 
Kissat), and sequential track (Kissat_MAB-HyWalk), as well as the second place 


T nttps://github.com/domschrei/mallob/tree/certified-unsat 
8 https: //github.com/marijnheule/drat-trim 
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Table 1: Overview of solved instances: (S)equential, (P)arallel, and (C)loud 


Solver Type|Solved|SAT)/UNSAT)/PAR-2 score 
Kissat_MAB-HyWalk S 218) 118 100 1065.7 
parkissat-rs P 299| 155 144 603.0 
mallob-ki P 260| 113 147 827.6 
mallob-capar P 292| 145 147 641.6 
mallob-capar-p (Seq.)|P 279| 140 139 719.8 
mallob-capar-p (Par.)|P 276| 141 135 731.4 
mallob-kicaliglu C 341| 165 176 344.8 
mallob-cacld C 333| 163 170 378.0 
mallob-cacld-p C 314| 159 155 484.1 


solver from the parallel track (mallob-ki, using Lingeling?). We then run a 
parallel and cloud version of Mallob that runs our described CaDiCaL portfolio 
without proof production (mallob-capar and mallob-cacld). 

Following the SAT competition setup, each cloud solver runs on 100 
m6i.4xlarge EC2 instances (16 core, 64GB RAM), each parallel solver runs on 
a single m6i.16xlarge EC2 instance (64 core, 256GB RAM), and the sequential 
Kissat_MAB-HyWalk runs on a single m6i.4xlarge EC2 instance. For each solver, 
we run the full benchmark suite from the SAT-Competition 2022 (400 formulas) 
containing both SAT and UNSAT examples. The timeout for the solving step is 
1000 seconds, and the timeout for all subsequent steps is set to 4000 seconds. 

Since earlier work [14] is no longer competitive in terms of solving time, 
we only compare proof-checking times. Specifically, we measure the overhead of 
checking un-pruned DRAT proofs as the ones produced by [14]. As such, we 
can get a picture of the performance of the earlier approach if it was realized 
with state-of-the-art solving techniques. We generate un-pruned DRAT proofs 
from the original (un-pruned) LRAT proof by stripping out the dependency 
information and adding delete lines for the last use of each clause. 


6.2 Results 


First we examine the performance overhead of changing portfolios to enable proof 
generation as described in Section 5 on the solving process only. Fig. 5 (left) and 
Table 1 show this data. The PAR-2 metric takes the average time to solve each 
problem, but counts a timeout result as a 2x penalty (e.g., given our timeout of 
1000 seconds, a timeout is scored as taking 2000 seconds). We can see that our 
CaDiCaL portfolio mallob-capar outperforms the Lingeling-based mallob-ki 
significantly and is almost on par with parkissat-rs. Similarly, mallob-cacld 
solves eight instances less compared to mallob-kicaliglu but performs almost 
equally well otherwise. In both cases, we have constructed solvers which are, 


? mallob-ki employed a Lingeling-based portfolio due to a misconfiguration, see: 
http://algo2.iti.kit.edu/schreiber/downloads/mallob-ki-mallob-1i.pdf 
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Fig. 5: Left: Comparison of solving times. Right: Relation of solving times to 
assembly and postprocessing times for mallob-cacld-p. Each pair of points 
corresponds to one instance, the y coordinate denoting the solving time. The 
left x coordinate denotes solving and assembly time and the right x coordinate 
denotes solving, assembly, and postprocessing time. 


up to a small margin, on par with the state of the art. For our actual proof- 
producing solvers, mallob-capar-p and mallob-cacld-p, we noticed a more 
pronounced decline in solving performance. On top of the overhead introduced 
by proof logging and our preprocessing, we experienced a few technical problems, 
including memory issues!°, which resulted in a drop in the number of instances 
solved and also caused mallob-capar-p with parallel proof production to solve 
three instances less than with sequential proof production. We believe that we 
can overcome these issues in future versions of our system. That being said, our 
proof-producing solvers already outperform any of the solvers at a lower scale. 

Second, we examine statistics on proof reconstruction and checking, show- 
ing results in Table 2. Since we want to investigate our approaches’ overhead 
compared to pure solving, we measure run times as a multiple of the solving 
time. (We provide absolute run times in the Appendix, Table 1.) The prefix 
“Seq.” denotes mallob-capar-p with sequential proof production, “Par.” denotes 
mallob-capar-p with distributed proof production run on a single machine, and 
“Cld.” denotes mallob-cacld-p with distributed proof production. 

DRAT checking succeeded in 81 out of 139 cases and timed out in 58 cases. 
For the successful cases, DRAT checking took 24.8x the solving time on av- 
erage whereas our sequential assembly, postprocessing and checking combined 
succeeded in 139 cases and only took 3.8x the solving time on average. This 
result confirms that our approach successfully overcomes the major scalability 
problems of earlier work [14]. In terms of uncompressed proof sizes, our LRAT 


10 We disabled Mallob’s memory panic mode to ensure consistent proof logging. 
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Table 2: Statistics on proof production and checking. All properties except for 
file sizes and pruning factor are given as a multiple of the solving time. We list 
minima, maxima, medians, arithmetic means, and the 10th and 90th percentiles. 


Property #| min} pl0| med} mean p90 max 
DRAT check 81/0.512/1.725] 7.442) 24.815) 67.065) 169.869 
Seq. assembly 139/0.019]0.305) 1.376} 2.324) 5.747 13.289 
Seq. postprocessing 139]0.001/0.012) 0.131) 0.263] 0.790 2.218 
Seq. checking 139)0.007)0.043} 0.572] 1.252) 3.970 10.980 
Seq. asm-+post-+chk 139/0.037)0.412} 2.110] 3.840] 10.834 26.487 
Par. assembly 135/0.059|0.080| 0.365} 0.805} 2.227 7.475 
Par. postprocessing 135]0.001/0.016) 0.156) 0.293} 0.861 2.300 
Par. checking 135|0.007|0.042) 0.622} 1.241) 3.540 11.645 
Par. asm-++post+chk 135)0.067/0.167} 1.097) 2.339] 6.611 21.420 
Cld. assembly 155/0.114]0.185) 1.412} 2.444) 5.410 44.268 
Cld. postprocessing 155/0.003|0.060| 0.696) 2.046| 4.785 39.096 
Cld. checking 155/0.033|0.189| 3.291| 8.883] 21.974| 170.378 
Cld. asm+post+chk 155/0.168|0.577| 5.110| 13.373| 32.484| 253.742 
DRAT proof size (GiB) |139|0.012/0.366} 1.236| 3.246| 8.395 29.308 
Seq. proof size (GiB) |139/0.016|0.223| 2.379| 5.384] 16.082 46.986 
Par. proof size (GiB) |135/0.006|0.173| 2.034| 5.345| 13.164 57.739 
Cld. proof size (GiB) |155/0.016|0.342| 3.940| 10.533| 30.130 89.106 
Cld. pruning factor 155/2.374|5.379|17.826|293.762|337.486|12466.700 


proofs can be about twice as large as the DRAT proofs, which seems more 
than acceptable considering the dramatic difference in performance. Given that 
DRAT-based checking was ineffective at the scale of parallel solvers, we decided 
to omit it in our distributed experiments which feature even larger proofs. 

Regarding mallob-capar-p with parallel proof production, we can see that 
the assembly time is reduced from 2.32x down to 0.81x the solving time on 
average, which also improves overall performance (3.84x to 2.34x). 

The results for mallob-cacld-p demonstrate that our proof assembly is feasi- 
ble, taking around 2.5x the solving time on average. We visualized this overhead 
and how it relates to the postprocessing overhead in Fig. 5 (right). The proofs 
produced are about twice as large as for mallob-capar-p. Considering that the 
proofs originate from 25 times as many solvers, this increase in size is quite mod- 
est, which can be explained by our proof pruning. We captured the pruning factor 
— the number of clauses in all partial proofs divided by the number of clauses in 
the combined proof — for each instance. Our pruning reduces the derived clauses 
by a factor of 293.8 on average (17.8 for the median instance), showing that it is 
a crucial technique to obtain proofs that are feasible to check. As such, we also 
managed to produce and check a proof of unsatisfiability for a formula whose 
unsatisfiability has not been verified before (PancakeVsInsertSort_8_7.cnf). 

Lastly, to compare our approach at the largest scale with the state of the 
art in sequential solving, we computed speedups of mallob-cacld-p, solv- 
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ing times only, over Kissat_MAB-HyWalk and arrived at a median speedup 
of 11.5 (Appendix, Table 2). We also analyzed drat-trim checking times of 
Kissat_MAB-HyWal1k, kindly provided by the competition organizers, and arrived 
at a median overhead of 1.1x its own solving time (Appendix, Table 3). Going by 
these measures, Kissat __MAB-HyWalk takes around 11.5-2.1 ~ 24.2x the solving 
time of mallob-cacld-p to arrive at a checked result while our complete pipeline 
only takes 5.1x the solving time for the median instance. This indicates that 
our approach is considerably faster than the best available sequential solvers. 

We can see that the bottleneck of our pipeline shifts from the assembly step 
further to the postprocessing and checking steps when increasing the degree of 
parallelism. This is to be expected since the latter steps are, so far, inherently 
sequential whereas our proof assembly is scalable. While the postprocessing step 
is a technical necessity in our current setup, we believe that large portions of it 
can be eliminated in the future with further engineering. For instance, enhancing 
the LRAT support of our modified CaDiCaL to natively handle unit clauses in 
the input would allow us to skip preprocessing and simplify postprocessing. 


7 Conclusion and Future Work 


Distributed clause-sharing solvers are currently the fastest tools for solving a 
wide range of difficult SAT problems. Nevertheless, they have previously not 
supported proof-generation techniques, leading to potential soundness concerns. 
In this paper, we have examined mechanisms to add efficient support for proof 
generation to clause-sharing portfolio solvers. Our results demonstrate that we 
can, with reasonable efficiency, add support to these solvers to have full confi- 
dence that the results they produce are correct. 

Following our research, more work is required to reduce overhead in the 
different steps involved and to improve scalability of the end-to-end procedure. 
This may include designing more efficient (perhaps even parallel) LRAT checkers, 
examining proof-streaming techniques to eliminate most I/O operations, and 
improving LRAT support in solver backends. In fact, it might be possible to 
generalize our approach to DRAT-based solvers by adding additional metadata, 
and this might allow easier retrofitting of the approach onto larger portfolios of 
solvers. We also intend to investigate producing proofs in Mallob for the case 
where many problems are solved at once and jobs are rescaled dynamically [26]. 
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Abstract. Proofs from SMT solvers ensure correctness independently 
from implementation, which is often a requirement when solvers are used 
in safety-critical applications or proof assistants. Alethe is an established 
SMT proof format generated by the solvers veriT and cvc5, with recon- 
struction support in the proof assistants Isabelle/HOL and Coq. The for- 
mat is close to SMT-LIB and allows both coarse- and fine-grained steps, 
facilitating proof production. However, it lacks a stand-alone checker, 
which harms its usability and hinders its adoption. Moreover, the coarse- 
grained steps can be too expensive to check and lead to verification fail- 
ures. We present CARCARA, an independent proof checker and elaborator 
for Alethe, implemented in Rust. It aims to increase the adoption of the 
format by providing push-button proof-checking for Alethe proofs, focus- 
ing on efficiency and usability; and by providing elaboration for coarse- 
grained steps into fine-grained ones, increasing the potential success rate 
of checking Alethe proofs in performance-critical validators, such as proof 
assistants. We evaluate CARCARA over a large set of Alethe proofs gen- 
erated from SMT-LIB problems and show that it has good performance 
and its elaboration techniques can make proofs easier to check. 


1 Introduction 


Satisfiability modulo theories (SMT) solvers are widely used as background tools 
in various formal method applications, ranging from proof assistants to program 
verification [9]. Since these applications rely on the SMT solver results, they must 
trust their correctness. However, state-of-the-art SMT solvers are often found to 
have bugs, despite the best efforts of developers [30,38]. One way to address 
this issue is to formally verify the solvers’ correctness (“certifying” them), but 
this approach can be prohibitively expensive and time consuming, besides often 
requiring performance compromises [19, 20,27, 33] and increasing the evolution 
cost of the systems [14]. Alternatively, solvers can produce proofs: independently 
checkable certificates that justify the correctness of their results. Since proof 
checking generally has lower complexity than solving, small and trusted checkers 
can verify solver results in an scalable manner. Despite the successful adoption 
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of this approach by several SMT solvers [7,13,15,24,37], no standard SMT proof 
format has emerged, with each system using their own format and independent 
toolchain. The Alethe! format [35] for SMT proofs however can be emitted by 
the veriT solver for several years [10] and recently? also by the cvc5 solver [7]. 
Moreover, Alethe proofs can be reconstructed within the proof assistants Coq [4, 
16] and Isabelle/HOL [11,36], which allows leveraging solvers who support the 
format (namely veriT and CVC4, the latter via a translator [16]) for automatic 
theorem proving. In Isabelle/HOL in particular this integration has been very 
successful with the veriT solver, significantly increasing the success rate of the 
popular Sledgehammer tactic [36]. The format has been refined and extended 
through the years [6], being now mature and used by multiple systems, with 
support for core SMT theories, quantifiers, and pre-processing. It allows different 
levels of granularity, so that solvers can provide coarse-grained proofs (which are 
easier to produce), or take the effort to produce more detailed, fine-grained proofs 
(which are often easier to check). It provides a term language close to SMT- 
LIB [8], facilitating printing from solvers as well as validating the connection 
between proofs and the corresponding proved problems. An overview of the 
Alethe proof format is given in Section 2. 

A significant drawback of the Alethe format, however, is that it does not 
have an independent proof checker. This makes it harder for solvers to adopt 
the format, since to test their proof production they must be directly integrated 
with the proof assistants with Alethe reconstructions available. Moreover, these 
reconstruction methods do not check whether proof steps comply to the format’s 
semantics, but rather are used as hints for internal tactics. Finally, the recon- 
struction techniques struggle with scalability due to well-known performance 
issues in the proof assistants [12,36]. 

In this paper we introduce CARCARA® (Section 3), an independent proof 
checker for Alethe proofs, implemented in a high-performance programming lan- 
guage, Rust. CARCARA is open-source and available under the Apache 2.0 license. 
Proof checking (Section 3.1) is performed by a collection of modules specific for 
each rule being checked. The presence of coarse-grained steps in Alethe requires 
special handling in the checker to account for missing information, which are dis- 
cussed in detail. CARCARA also provides proof elaboration methods (Section 3.2) 
for particularly impactful coarse-grained steps, so that they can be automati- 
cally translated, offline from the solver, into easier-to-check fine-grained steps. 
We evaluate (Section 4) CARCARA’s proof checking on a large set of proofs 
generated by veriT from SMT-LIB problems, analyzing its performance and ef- 
fectiveness. The same set of proofs is used to evaluate the proof elaboration 
methods, where we analyze how checking elaborated proofs compares with the 


1 The format was previously known as the “veriT format”, but it has recently been 

renamed to reflect its independence from any individual solver. 

cvc5’s support for Alethe is still experimental and is under active development. CAR- 

CARA can actually be instrumental for improving cvc5’s support for Alethe. 

3 We follow on the bird theme of the “Alethe” name. Carcara is the Portuguese word 
for the crested caracara, a resourceful bird of prey native of South America. 


2 


An Efficient Proof Checker and Elaborator for Alethe Proofs 369 


originals. Our analysis shows that CARCARA has performant proof checking and 
can identify wrong proofs produced by veriT. It also shows that elaboration can 
in some cases generate proofs significantly easier to check than the original ones. 


1.1 Related work 


CARCARA is inspired by the highly-successful DRAT-trim [23] proof checker 
for SAT proofs, which has been instrumental to the extensive usage of proofs 
in toolchains involving SAT solvers. It has also provided a basis for numerous 
advances in SAT proofs, with new proof formats and new checking techniques. 
We see its performant proof checking and elaboration techniques as the key 
elements to its success, serving both as an independent checker and as a bridge 
between solvers and performance-critical checkers, such as proof assistants or 
certified checkers. Providing both these features is the main goal of CARCARA. 

The checker for the Logical Framework with Side Conditions (LFSC) [37], an 
extension of Edinburgh’s Logical Framework (LF) [22], written in C++, is also a 
stand-alone, non-certified, highly efficient proof checker. The logical framework, 
where new rules can be mechanized in a language understood by the checker, 
provides great flexibility, and LFSC has been successfully used as a proof format 
for CVC4 [28] and evc5 [5]. Similarly, Dedukti [25] is an OCaml checker for the 
AlT-calculus, another extension of LF, and has been applied to SMT proofs, in- 
cluding to Alethe*. However, we are not aware of any mature implementation for 
this end. Elaboration techniques have not been the focus in these tools. Another 
difference is that they are based on dependently-typed languages far-removed 
from SMT-LIB, and generating proofs from SMT solvers for them can be more 
challenging, as well as relating the resulting proofs to the original problems. 

An independent checker has been proposed for SMT proofs [34] from the 
OpenSMT [26] solver. The checker targets problems with uninterpreted func- 
tions and linear arithmetic, but does not support quantifiers nor pre-processing. 
It leverages DRAT-trim for the propositional reasoning and employs Python 
components for checking the other parts of the proof. Different components can 
use different proof formats, and to the best of our knowledge no comprehensive 
specification of the overall format is available. Some SMT solvers, such as SMT- 
Interpol [24] and cvc5 [7], have internal checkers for their proofs. Since these are 
not independent from the solvers, they are incomparable to our approach. 


2 The Alethe Proof Format 


Alethe was originally designed [10] as a proof-assistant friendly, easy-to-produce 
proof format for SMT solvers. A clear specification of the rules in a reference 
document [2] is provided, facilitating reconstruction within proof assistants by 
avoiding ambiguous syntax or semantics. To facilitate proof production, Alethe 
uses a term language that directly extends SMT-LIB, thus not requiring solvers 


4 “Verine” library available at https://deducteam. github. io/data/libraries/verine.tar.gz 
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to translate between different term languages when outputting proofs. More im- 
portantly, Alethe’s proof calculus provides rules with varying levels of granular- 
ity, allowing coarse-grained steps and relying on powerful proof checkers for filling 
in the gaps. This reduces the burden on developers to track all reasoning steps 
performed by the solver, a notoriously difficult task [7]. The set of rules in the 
format captures SMT solving (as generally performed by CDCL(T )-based SMT 
solvers [31]) for problems containing a mix of any of quantifiers, uninterpreted 
functions, and linear arithmetic, as well as multiple pre-processing techniques. As 
a testament of the format’s success, it has been refined and extended throughout 
the years [6], and has been used as the basis for the integration, with the proof 
assistants Isabelle/HOL and Coq, of the SMT solvers veriT [6,36], CVC4 [16] 
and cvcd [5, Sec. 3]. 

Here we briefly overview the Alethe proof format. For the full description of 
its syntax and semantics please see [2]. We assume the reader is familiar with 
basic notions of many-sorted equational first-order logic [17]. Alethe proofs have 
the form 7: yi A++ A Yn > L, i.e., they are refutations, where | is derived 
from assumptions %1, ..., Yn corresponding to the original SMT instance be- 
ing refuted. Proofs are a series of steps represented as an indexed list of step 
commands. The command assume is analogous to step but used only for intro- 
ducing assumptions. The indexed steps induce a directed acyclic graph rooted 
on the step concluding L and with the assumptions y1, ..., Qn as leaves. Steps 
represent inferences and abstractly have the form 


C1, -++5 Ck D> i %1,- ., Yı (rule pi, ..., Pn) lai, ---, Gm] 


where rule names the inference rule used in this step. Every step has an iden- 
tifier 7 and concludes a clause, represented as a list of literals 4%1,..., Yı. The 
premises are identifiers p1,...,pn of previous steps or assumptions, and rule- 
dependent arguments are terms @1,...,@m; steps may occur under a contest, 
which is defined by bound variables or substitutions c,,...,c,. Contexts are in- 
troduced by the anchor command, which opens subproofs. Subproofs simulate 
the effect of the =-introduction rule of Natural Deduction, where local assump- 
tions are put in context and the last step in a subproof represents its conclusion 
and the closing of its context. Besides arbitrary formulas, Alethe has support for 
contexts which put in scope bound variables and substitutions, which are useful 
for representing pre-processing techniques in the presence of binders [6], such as 
Skolemization, let elimination and alpha-conversion. 

The structure of Alethe proofs is motivated by SMT solvers generally oper- 
ating with a cooperation of a SAT solver and multiple engines to perform theory 
reasoning, deriving new facts and applying simplifications. The overall proof may 
be seen as a ground first-order resolution proof with theory lemmas justified by 
closed subproofs. Thus the emphasis on steps concluding clauses as term lists, 
which avoids ambiguity as to what clause a disjunction represents. An example 
is that whether a resolution step concluding the term A V B corresponds to the 
clause [A, B] or [AV B] depends on the premises. The use of identifiers for steps 
allows representing proofs as directed acyclic graphs rather than trees. Similarly, 
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(set-logic LIA) 

(assert (forall ((x Int)) (> x 0))) 
(assert (not (forall ((y Int)) © y 0)))) 
(check-sat) 


(assume hi (forall ((x Int)) (> x 0))) 

(assume h2 (not (fora (Cy Int)) © y 0)))) 

(anchor :step t3 :args ((y Int) (:= x y))) 

(step t3.t1 (cl (= x y)) :rule refl) 

(step t3.t2 (cl (= (> x 0) © y 0))) :rule cong :premises (t3.t1)) 


(step t3 (cl (= (forall ((x Int)) (> x 0)) (forall (Cy Int)) © y 0)))) 
:rule bind) 
(step t4 (cl (not (forall ((x Int)) (> x 0))) (forall (Cy Int)) © y 0))) 


:rule equivi :premises (t3)) 
(step t5 (cl) :rule resolution :premises (t4 hi h2)) 


Fig.1: A simple SMT-LIB problem and an Alethe proof of its unsatisfiability. 


term sharing can be achieved via the SMT-LIB :named attribute or define-fun 
commands [8, Secl 4.1.6], which both allow naming subterms. These measures 
are essential for compact representation of proofs, which can be prohibitively 
large otherwise. Explicitly providing the conclusion of proof steps aims to both 
facilitate proof checking (as it allows steps to be verified locally) and proof pro- 
duction, so coarse-grained rules that do not uniquely define their conclusions 
from premises and arguments can be effectively checked. 


Example 1. Figure 1 shows an SMT-LIB problem and an Alethe proof of its 
unsatisfiability. Note that in Alethe’s concrete syntax clauses are represented via 
the cl operator (the only exception are conclusions of assume commands, which 
are considered unit clauses) and the context is not explicitly put in the steps, but 
rather assumed for all steps under (potentially nested) anchors introducing its 
elements. For this proof to be valid, three conditions need to be met: each assume 
command must correspond to an assert command in the original problem, 
every step command must be valid according to the semantics of its rule, and 
the proof must end with a step that concludes the empty clause (cl). The 
proof satisfies the first condition, as the terms in the assume commands are 
precisely the asserted terms in the SMT problem. The third condition holds as 
t5, the last step, concludes the empty clause. For the second condition, step t4 
is a direct consequence of the equivalence in its premise, t3, so it remains to 
check step t3, which is derived from a subproof. The anchor for t3 introduces a 
bound variable y and a substitution {x > y}. The steps in the subproof contain 
terms with this new variable and operate under this substitution. The rule ref1 
models reflexivity modulo the cumulative, capture-avoiding substitution in the 
(potentially nested) context, and thus t3.t1 holds since « = y{a +> y}. Step 
t3.t2 is regular congruence with the operator “>” and does not depend on the 
context. Finally, step t3 holds because its subproof shows the equivalence of the 
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Fig. 2: Overview of the architecture of CARCARA. 


bodies of the quantifiers under the renaming, introduced in the context, into a 
fresh variable relative to the left-hand side quantifier. Since all steps follow the 
expected semantics, all conditions are met and the proof is valid. 


In the next section we show how CARCARA checks the above conditions, 
highlighting some challenging rules and showing how some coarse-grained steps 
are elaborated into proofs potentially simpler to check. 


3 Architecture and core components 


CARCARA is developed in the Rust programming language, and is publicly avail- 
able under the Apache 2.0 license. Its architecture is shown in Figure 2. It pro- 
vides both a command line interface and bindings for a Rust API. The main 
component is the proof checking one, with 6.5k LOC, which is a collection of 
procedures for each rule to be checked (Section 3.1). The elaborator has 1k 
LOC and has an interface to the cvc5 solver, as well as a collection of elabo- 
ration methods and a post-processing module to knit together the elaborated 
proof (Section 3.2). The other components together have 6k LOC, including a 
handwritten 2k LOC SMT-LIB and Alethe parser, and an Alethe printer. 

The inputs of CARCARA are an SMT-LIB problem y and an Alethe proof 
T : p — L. In proof-checking mode it checks each step in m with the respective 
procedure for its rule and prints either valid, when all steps are successfully 
checked and the proof concludes the empty clause (cl), holey when 7 is valid 
but contains steps that are not checked (“holes”), and invalid otherwise, to- 
gether with an error message indicating the first step where checking failed and 
why. In proof-elaboration mode it converts 7 into 7’ : + L, where some steps 
may be replaced by a series of steps elaborating them, and prints 7’. 


5 https://github.com/ufmg-smite/carcara 
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3.1 Checking Alethe proofs 


First the original SMT-LIB problem and its Alethe proof are parsed. The prob- 
lem provides the declaration of sorts and symbols that may be used in the proof, 
as well as the original assertions, which must match the assumptions in the proof. 
Symbol definitions in the proof for term sharing are expanded during parsing. 
Terms are internally represented as directed acyclic graphs, using hash consing 
for maximal sharing and constant-time syntactically-equality tests. The proof is 
represented internally as an array of command objects, each corresponding ei- 
ther to an Alethe assume or step command, or a subproof, which is represented 
as a step with an (arbitrarily) nested array of command objects. Step identifiers 
are converted into indices for the arrays, so that access is constant-time. 

Each command is checked individually by the rule checker corresponding to 
the rule in that command. That component takes as input the conclusion, the 
conclusions of its premises, and the arguments of the command, as well as the 
context it is in. As the Alethe format currently has 90 possible rules, CARCARA 
has 90 rule checkers. We highlight below some of the rule checkers as well as 
some challenges for checking Alethe proofs and how we addressed them. 


Term equality tests. Terms introduced by Alethe rules may have equality sub- 
terms implicitly reordered, but the rules are still valid if the conclusion changes 
only in this way. This flexibility is motivated by solvers often internally repre- 
senting equalities ignoring order, which may lead to equalities being implicitly 
reordered when appearing in facts derived by these components. The congruence 
closure procedure [29] commonly used in SMT is an example of such a compo- 
nent. Since equality symmetry justifies these reorderings, but keeping track of 
all the changes can be challenging, the format allows them to be implicit. 

As a consequence, syntactic equality cannot be the only test for whether two 
terms are the same. For example, the terms (and p (= a b)) and (and p (= 
b a)) may be required to be equal. Thus CARCARA tests equality in two phases: 
first if they are syntactically equal, in which case they can be compared in con- 
stant time; otherwise they are simultaneously traversed and equality subterms 
in the same position are compared modulo equality reordering, failing as soon as 
subterms differ. We refer to this as a polyequal test. As we will see in Section 4.1, 
these tests can be a substantial portion of overall checking time in some cases. 


Checking initial assumptions. The initial assume commands in an Alethe proof 
must correspond to assertions in the original problem, so their checker searches 
through the assertions to find a match. In general, this can be done efficiently: 
assertions are stored in a hash set during parsing, and these assume commands 
are valid if their conclusions occur in the set. However, assume commands are 
also impacted by implicit equality reordering, thus requiring polyequal tests. 
When an assumption does not occur in the assertions hash set, the checker 
attempts to match it to each assertion in turn, performing a polyequal test. 
As a consequence, when the original problem is large and the assertions similar 
and deep, checking assume steps may dominate overall checking time, as our 
experiments show (Section 4.1). 
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Checking contextual steps. Steps within subproofs may depend on their context 
to be valid, so before checking these steps, a context object is built based on the 
anchor opening the subproof. As shown in Section 2, context elements on which 
rules may depend are bound variables and substitutions. The former make new 
symbols available to build terms, while the latter allows steps to be valid modulo 
applying these substitutions. 

Substitutions in Alethe are capture-avoiding, renaming bound variables dur- 
ing application, which facilitates producing proofs with binders [6]. However, it 
has the side effect of also preventing constant-time equality tests, since we must 
rather check a-equivalence, i.e., a term with bound variables may be required to 
be equal® to the result of applying a substitution that may have renamed some 
of these variables. To avoid spurious renaming when applying substitutions, the 
checker only renames bound variables which occur as free variables in the substi- 
tution range. Since computing free variables is itself costly, it is done lazily, only 
when the substitution is to be applied under a binder, and the result is cached. 

Note that, as subproofs can be nested, the substitution in context for a step 
is the composition of a stack of substitutions o1,...,@,. To avoid sequential 
application of substitutions, Alethe requires the substitution ø in context to be 
a cumulative substitution in which every term t in the range of the substitution 
0,41 is replaced by to;. Thus o can be applied simultaneously and correspond to 
a sequential application of o1,...,0,. As a result of these requirements, handling 
and applying substitutions can be expensive in Alethe, as shown in Section 4.1. 

Finally, the rules enclosing subproofs must be checked to whether their con- 
clusions are valid from the introduced context and resulting subproof. For exam- 
ple, the bind rule in Example 1 requires that the bound variable in the quantifier 
at the right-hand side of the equality matches the range of the substitution put 
in context for its subproof. The subproof rule, which introduces local assump- 
tions @1,...,@,, and concludes a formula ~aj V --- ~al, V p, requires that the 
enclosed subproof derives y and that each a; match aj. 

We now highlight coarse-grained rules whose checking is more intricate and 
expensive. 


Resolution. The rule resolution in Alethe captures hyper-resolution on ground 
first-order clauses, i.e., 


Ci e G 


resolution, pi,p2,...,;Pn—1 


where C1, ...,Cn are premises; p; the pivot for the binary resolution between C; 
and Ci+1, occurring as is in C; and as 7p; in Ci+1; and C the conclusion. While 
it is simple to check such steps, Alethe allows resolution steps to not provide 
the pivots, for the sake of facilitating proof-production in solvers. Checking such 
steps requires searching for the pivots and in which binary resolution they are to 


6 Since Alethe has bound-variable renaming rules, the checker requires names to be 
handled properly, rather than normalizing all binders internally via De Brujin indices. 
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be used, but CARCARA applies an incomplete heuristic where pivots are inferred 
between the difference of literals in the premises and in the conclusion (i.e., 
literals not in the conclusion must have been pivots eventually eliminated). If 
that fails, we apply a reverse unit propagation (RUP) test [21], i.e., the step is 
valid if we can derive a conflict via Boolean Constraint Propagation from the 
premises and the negated conclusion. Note that CARCARA also allows the pivots 
to be provided as arguments, in which case checking is simple, as expected. 


AC simplification. Normalization modulo associativity and commutativity for 
conjunction and disjunction can be represented in Alethe via the ac_simp rule, 
which establishes the equality between a term t and a term t that is t but 
with nested occurrences of these connectives flattened and duplicate arguments 
removed, until a fix-point. While this simplification is performance-critical [6, 
Sec. 4.6], checking the corresponding rule requires traversing t and performing 
the normalization, which is proportional to t’s depth. 


Arithmetic reasoning. Apart from simplification rules, arithmetic reasoning in 
Alethe is mainly captured by two rules: la_generic and lia_generic. Both 
rules conclude a clause of negated linear inequalities, which is valid due to the 
Farkas’ lemma [18] guaranteeing that there exists a linear combination of these 
inequalities equivalent to L. The la_generic rule takes as arguments the coeffi- 
cients of this linear combination, with which the rule can be checked by applying 
simple (but costly) operations on the coefficients to reduce the linear combina- 
tion to L (see [2, Sec 5.4, Rule 9] for the algorithm). The checker uses GMP [1] 
for efficiently performing the required computations with the coefficients. 

While la_generic can be checked effectively, lia_generic cannot. It pro- 
vides only the negated inequalities, which would require searching for the coef- 
ficients to perform the checking, essentially requiring the arithmetic solving to 
be repeated in the checker. As a consequence this rule is considered a hole and 
CARCARA ignores it during proof checking, issuing a warning. 


3.2 Elaborating Alethe proofs 


In order to mitigate bottlenecks in checking some Alethe steps, CARCARA can 
also elaborate Alethe proofs into easier-to-check ones by filling in missing details 
from the original proofs. This is done by replacing coarse-grained steps with fine- 
grained proofs of their conclusions, producing a new overall proof equivalent to 
the original, but with some coarse-grained steps broken down into fine-grained 
ones. Formally, a proof as the one below on the left, with a coarse step concluding 
w from premises Yı, ..., Wn, is elaborated into the proof on the right where the 
coarse step is replaced by a proof m, with fine-grained steps, rooted on Y% and 
with Y1,..., Wn as leaves: is cae, 


pı ikiii Pn 
—————— COARSESTEP E E 
RULE — RULE 


oO > elab (2) 
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(step t2.t1 (cl (mot (= a b)) (mot (= b c)) (not (= c d)) (= a d)) 
:rule eq_transitive) 

(step t2.t2 (cl (mot (= b a)) (= a b)) :rule eq_symmetric) 

(step t2.t3 (c (= c b)) (= b c)) :rule eq_symmetric) 

(step t2.t4 (cl (mot (= c d)) (= ad) (not (= b a)) (not (= c b))) 
:rule resolution :premises (t2.t1 t2.t2 t2.t3)) 

(step t2 (cl (mot (= b a)) (mot (= c d)) (not (= c b)) (= a d)) 
:rule reordering :premises (t2.t4)) 


Fig. 3: Elaboration of an eq_transitive step. Note the new eq_transitive step 
is easy to check, and the new t2 step has the same conclusion as the original. 


Note the expansion only affects the proof locally, since any step using the conclu- 
sion of the coarse step as a premise may use the conclusion of m interchangeably. 

There are many Alethe rules whose checking would be simpler if elaborated, 
but we have focused initially on what we believe can be more impactful: removing 
implicit equality reordering, and thus polyequal tests, which affects virtually 
every Alethe rule; and providing checkable justifications for lia_generic steps, 
to remove holes from proofs. Before detailing these methods, we illustrate the 
elaboration process with an example. 


Elaborating transitivity steps. The eq_transitive rule concludes a valid clause 
composed of negated equalities followed by a single positive equality, such that 
the negated equalities form a transitive chain resulting in the final equality. 
However, the specification does not impose an order on the negated equalities 
(which can, remember, also be implicitly reordered). So the following step must 
also be valid, with a “shuffled” chain: 


(step t2 (cl (mot (= b a)) (mot (= c d)) (not (= c b)) (= a d)) 
:rule eq_transitive) 


This permissive specification again facilitates proof production (particularly 
from congruence closure procedures), but requires the eq_transitive checker, 
for every link in the chain, to potentially traverse the whole clause searching for 
the next one, performing polyequal tests throughout. The goal of elaborating 
eq_transitive steps is that steps like t2 are justified in a fine-grained manner. 
If we changed the conclusion of the step, this would impact the rest of the proof, 
if t2 is used anywhere as a premise. We instead introduce a fine-grained proof 
for t2’s conclusion, as shown in Figure 3: an easy-to-check eq_transitive step 
(t2.t1), eq_symmetric steps to flip the equalities (t2.t2, t2.t3), resolution 
(t2.t4) and reordering (t2.t5) steps to derive the original conclusion. 


Elaborating implicit equality reordering. Similarly to above, steps concluding a 
term t, with some subterm equality implicitly reordered, have their conclusion 
replaced by t’ where that subterm is not reordered and a fine-grained proof of 
the conversion of t’ into t is added. Figure 4 illustrates this process for an assume 
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(set-logic QF_UF) 
(declare-const a Bool) 
(declare-const b Bool) 
(declare-const p Bool) 
(assert (not (or p (= a b)))) 
(assert (or p (= b a))) Fig. 4b: An Alethe proof for the SMT 
(check-sat) problem in Figure 4a. Notice that this 
proof makes use of implicit reordering 
of equalities in h2. 


(assume hi (not (or p (= a b)))) 

(assume h2 (or p (= a b))) 

(step t3 (cl) :rule resolution 
:premises (hi h2)) 


Fig. 4a: An example SMT problem in- 
stance. 


(assume hi (not (or p (= a b)))) 
(assume h2 (or p (= b a))) 
(step h2.t1 (cl (= (= b a) (= a b))) :rule equiv_simplify) 
(step h2.t2 (cl (= (or p (= b a)) (or p (= a b)))) 
:rule cong :premises (h2.t1)) 
(step h2.t3 (cl (mot (or p (= b a))) (or p (= a b))) 
:rule equivi :premises (h2.t2)) 
(step h2.t4 (cl (or p (= a b))) :rule resolution :premises (h2 h2.t3)) 
(step t3 (cl) :rule resolution :premises (hi h2.t4)) 


Fig. 4c: The elaborated proof without implicit equality reordering. 


Fig. 4: An example of the elaboration to remove implicit equality reordering. 


command, where note that step h2.t1 is the rewriting justifying the equality 
reordering of the subterm and the following steps rebuild the original conclusion. 

In the original proof, the assume command h2 introduces the term (or p (= 
a b)), which is the original assertion (or p (= b a)) with the equality (= b 
a) implicitly reordered. In the elaborated proof (Figure 4c), the conclusion of 
h2 is replaced by one without implicit equality reordering, but step t3 expects 
the original conclusion. The steps h2.t1 to h2.t4 convert the new h2 conclusion 
into the original one, relying on standard equality reasoning and on resolution to 
connect the introduced steps. Notice that the t3 step, which originally refered 
to h2 as a premise, now refers to h2.t4. 

When applied to every concluding terms with implicit equality reordering, 
the result of this elaboration method is a proof where equality tests are only 
syntactic, erasing the overhead of checking assumptions and polyequal tests. 


Elaborating tia_generic steps. As discussed in Section 3.1, CARCARA considers 
lia_generic steps holes in the proof, as their checking is as hard as solving. Since 
our goal is to keep CARCARA as simple as possible, we rely on an external tool to 
elaborate the step by solving a problem corresponding to it in a proof-producing 
manner, then import the proof, checking it and guaranteeing that it is sound to 
replace the original step. Any tool producing detailed Alethe proofs for linear- 
integer arithmetic reasoning can be used to this end, but currently only cvc5 
can do so [7]. We note that cvc5 currently has the limitation that its Alethe 
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proofs may contain rewrite steps not yet modeled in the Alethe simplification 
rules [2, Sec 5.11], and are thus not supported by CARCARA. They are considered 
holes, but since these are generally simple simplification rules, are much less 
harmful than lia_generic ones. 

In detail, the elaboration method, when encountering a lia_generic step 
S concluding the negated inequalities ~l V --- V —ln, generates an SMT-LIB 
problem asserting lı \--- AJ, and invokes cvc5 on it, expecting an Alethe proof 
T: (LA: Aln) > L. CARCARA will check each step in 7 and, if they are not 
invalid, will replace step S' in the original proof by a proof of the form: 


(anchor :step S.t_m+1) 
(assume S.h_i 11) 


(assume S.h_n 1n) 


(step S.t_m (cl false) :rule ...) 


(step S.t_m+1 (cl (mot 11) ... (mot ln) false) :rule subproof) 
(step S.t_mt+2 (cl (not false)) :rule false) 
(step S (cl (not 11) ... (not 1n)) 


:rule resolution :premises (S.t_m+1 S.t_m+2)) 


where steps S.h_1 until S.t_m are imported from the cvc5 proof. As a result the 
lia_generic step S in the original proof will have been replaced by a detailed 
justification whose correctness can be independently established by CARCARA. 


4 Evaluation 


We evaluate CARCARA for proof-checking performance and the impact of elabo- 
ration methods. We use the veriT solver [13], version 2021.06-40-rmx, to generate 
Alethe proofs from all problems in the SMT-LIB benchmark library’ whose logic 
it supports, with a 120 seconds timeout. We did not consider cvc5 as its support 
for Alethe is not yet as mature or complete. The veriT solver produced 39,229 
proofs. They total 92gb, but vary greatly in size. The biggest proof has 4.5gb, 
fourteen have at least 1gb and over a hundred have more than 100mb, while 
almost 90% are under 1mb. All the experiments were run on a server equipped 
with AWS Graviton2 2.5 GHz ARM CPUs, with 4 GB of memory for each job. 


4.1 Proof checking 


We ran CARCARA on each proof until checking succeeded or failed. Only 378 had 
checking failures, which were due to incorrect® steps for quantifier simplifications 
(Skolemization and elimination of one-point quantifiers) and AC normalization. 
The issues have been communicated to the solver developers. For the success- 
ful proofs, a summary is given in Table 1, for each SMT-LIB logic, with the 
cumulative solving time by veriT and checking time by CARCARA. 


T https: //smtlib.cs.uiowa.edu/benchmarks.shtml 
8 In a superficial analysis the steps seemed sound, but the proofs were incorrect. 
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Logic Problems Solving time (s) Checking time (s) Ratio 
AUFLIA 2135 1094.67 12.51 87.53 
AUFLIRA 19200 248.95 144.03 1.73 
UF 2885 2858.14 30.95 92.35 
UFIDL 55 0.54 0.66 0.82 
UFLIA 7221 3547.78 136.21 26.05 
UFLRA 10 0.02 0.01 3.05 
QF_ALIA 16 0.79 1.39 0.57 
QF_AUFLIA 256 0.34 0.11 3.04 
QF IDL 609 3316.08 2240.10 .48 
QF_LIA 1018 5975.36 742.73 8.05 
QF_LRA 537 3629.39 258.60 14.03 
QF_RDL 81 620.46 123.14 5.04 
QF_UF 4180 3857.34 1881.55 2.05 
QF_UFIDL 66 396.74 87.58 4.53 
QF_UFLIA 167 1194.51 4.70 254.41 
QF_UFLRA 415 141.82 65.14 2.18 
Total 38851 26882.93 5729.39 4.69 


Table 1: Total solving and proof-checking time per logic for veriT and CARCARA. 


As expected, the comparison is heavily logic-dependent. In quantified log- 
ics (top of the table), checking is generally significantly cheaper than solving. 
An outlier is AUFLIRA, which is explained by the problems to which veriT 
could produce proofs being all both simple to solve and check. In logics such as 
QF_UF and QF_IDL, which can have very large proofs, overall checking time is 
comparable to solving time, if still noticeably smaller in total. 

When comparing per-problem, for the large majority of proofs (81.61%) the 
checking time was smaller than the solving time. Furthermore, for 3.96% of the 
proofs, checking was more than 10 times faster than solving the problem, and 
for 0.96%, that ratio was of 100 times. There were only 24 instances where the 
checking time was more than 10 times bigger than the solving time, and, in all 
of them, the checking time was less than 0.6 seconds. 

We also evaluate the per-rule frequency, as shown in Figure 5b, and checking 
time, with Figure 6a showing the cumulative checking times and Figure 5a a 
box plot considering individual rule checks. The lower whisker represents the 
5th percentile, the lower bound of the box represents the first quartile, the line 
inside the box represents the median, the upper bound of the box represents the 
third quartile, and the upper whisker represents the 95th percentile®. Rules that 
are rare and have negligible checking time are omitted. The data is gathered 
from proof checking in all proofs, even those that failed. 

The assume commands account for a large proportion of the total time. 
This is justified by their checking, due to implicit equality reordering, being 
potentially proportional to both the quantity and the depth of assertions in the 
original problems. The box plot shows that the worse cases lead to the most 
expensive rule checks among all rules. 


° The plots follow the same criteria of the evaluation in [36]. 
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Fig. 5a: Box plot for checking time per rule. frequent shown). 


Rules with highest overall time are resolution, ac_simp and la_generic. 
For resolution this is explained mainly by its high frequency (this is similarly 
the case for cong), as well as by some more expensive checks (veriT does not 
provide pivots), as shown in the box plot. As for ac_simp and la_generic, while 
they are much less frequent, their checking is expensive (Section 3.1). 

Other expensive rules to note are those related to contexts involving sub- 
stitutions!?, specially let, for let elimination, and refl. It is common for let 
subproofs to be deeply nested, leading to large cumulative substitutions needing 
to be computed. As for ref1, besides being one of the most frequent rules, about 
a third of its total time is spent on polyequal tests, and most of the rest is related 
to handling and applying substitutions, as well as checking alpha-equivalence. 


4.2 Proof elaboration 


We ran CARCARA, on each successfully checked proof, in proof-elaboration mode 
with the elaboration of transitivity steps and, more importantly, the removal of 
implicit equality reordering. On average, excluding parsing, elaboration takes 
40% of the time required for checking. We focus on the impact on proof checking 
of the result of elaboration. 

In Figure 7 we have the comparison, per proof, of the proof-checking time on 
the original proof and on the elaborated one (excluding parsing time). There is 
not a clear winner, but note that for harder proofs (those originally requiring at 
least 1s), checking the elaborated proof is often significantly faster. A per-rule 
analysis is shown in Figure 6b, with the proportion of the checking time spent 


10 The ones shown in the plots are let, bind, sko_forall, and onepoint. 
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Fig. 6a: Total checking time per rule. Fig. 6b: Times after elaboration. 


in each rule, for the elaborated proofs. Comparing to Figure 6a, the checking 
time for assume steps becomes negligible in the elaborated proofs, as checking 
them now amounts to checking occurrence in a hash set. The overall time for 
refl also decreases, but only by 10%. This can be explained by the ref1 steps 
added during elaboration. While checking each ref1 is now potentially cheaper, 
this is offset by their increased number. Note that these additions also impact 
other rules, specially cong, whose cumulative time increased by 13%. Overall, 
proof elaboration resulted in a net improvement in checking time of 6%. Parsing 
time, however, increased, which made the overall runtime for proof-checking the 
original proofs virtually the same as for the elaborated proofs. 


The results indicate that elaborat- = i K Z F 
: : ai : ; : — 7 á d 
ing implicit equality reordering is not adaa as E F 


always worth it, specially for high- 
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for reconstructing some rules [36], which ci 

this elaboration method would avoid. 


elaborated time 


10-7 gi 


Fig. 7: Before vs after elaboration. 


Elaborating lia_generic steps. In our 

benchmark set, 276 proofs contain a total of 127k lia_generic steps. As a 
proof of concept we instrumented CARCARA to apply the elaboration method 
described in Section 3.2 via a connection with cvc5!'. Due to the still experimen- 
tal Alethe proof production in cvc5, we only considered SMT problems derived 


1 evc5-1.0.2, modified for better Alethe support, provided by the cvc5 team. 
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from lia_generic steps in proofs for the QF_UFLIA and QF _LIA logics. This 
excluded only 15 proofs, each containing exactly one lia_generic step. We ran 
CARCARA on proof-elaboration mode with a 30 minute timeout for each proof. 
For each lia_generic step, cvc5 was invoked with a 30s timeout and the result- 
ing Alethe proof, if any, replaced the original lia_generic step, as described in 
Section 3.2. 

Of the 261 proofs, CARCARA timed out on only 13 of them. Of the remaining 
248 proofs, 82 still contained lia_generic steps after elaboration, either because 
cvcd timed out when solving the generated problem, or because the cvc5 proofs 
contained lia_generic steps of their own. Note however that they are still im- 
provements over the original lia_generic steps, since generally less inequalities 
are involved and the steps are potentially simpler to solve, were the process to 
be repeated. Similarly, although all elaborated proofs contained holes from cvc5 
rewriting steps, these are much simpler than the original lia_generic ones. 

As with the elaboration of implicit equality reordering, this elaboration method 
would be particularly impactful in scenarios such as Alethe reconstruction in Is- 
abelle/HOL. Steps such as lia_generic are reconstructed via limited internal 
automation for arithmetic reasoning, which is known to fail [36, Sec. 4.3]. 


5 Conclusion and future work 


Our evaluation shows that CARCARA has good performance and can identify 
shortcomings in the proof-production of established SMT solvers. CARCARA can 
also elaborate proofs into demonstrably easier-to-check ones, which can have a 
significant impact, for example, if it is used as a bridge between solvers and proof 
assistants. Extending CARCARA to convert Alethe proofs into other formats 
would also allow the elaboration techniques to benefit other toolchains. 

As future work, we will add support for parallel proof checking, since steps 
in the same context can be checked completely independently. We will also add 
new elaboration methods for resolution and ac_simp, which occasionally are 
bottlenecks, and will provide elaboration for rewrite rules, which can change 
significantly between different solvers, complicating proof-production if solvers 
have to phrase their rewrites with a fixed set of rules. An automatic conversion 
into a defined set of rewrite rules, as described in [32], would address this issue. 

Finally, we expect CARCARA to facilitate improving how we use Alethe 
proofs. For example, our large-scale evaluation shows the significant time spent 
on contextual substitutions, which is mainly due to the Alethe requirement of 
only applying substitutions simultaneously. Extending the proof format to allow 
other substitution application strategies may be beneficial for different scenar- 
ios, as proof production in some solvers has indicated [7, Sec 5.1]. In general, 
extensions to the format (for example, to other logical theories) can be done in 
a more informed way with the help of an independent checker. 


Acknowledgments. We thank the reviewers for their helpful suggestions to improve 
this paper as well as CARCARA. We thank Hans-Jörg Schurr for his extensive work in 
detailing the semantics of Alethe, which greatly facilitated developing CARCARA. 
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Abstract. A packing k-coloring is a natural variation on the standard 
notion of graph k-coloring, where vertices are assigned numbers from 
{1,..., k}, and any two vertices assigned a common color c € {1,..., k} 
need to be at a distance greater than c (as opposed to 1, in standard 
graph colorings). Despite a sequence of incremental work, determining 
the packing chromatic number of the infinite square grid has remained 
an open problem since its introduction in 2002. We culminate the search 
by proving this number to be 15. We achieve this result by improving 
the best-known method for this problem by roughly two orders of mag- 
nitude. The most important technique to boost performance is a novel, 
surprisingly effective propositional encoding for packing colorings. Addi- 
tionally, we developed an alternative symmetry breaking method. Since 
both new techniques are more complex than existing techniques for this 
problem, a verified approach is required to trust them. We include both 
techniques in a proof of unsatisfiability, reducing the trusted core to the 
correctness of the direct encoding. 


Keywords: Packing coloring - SAT - Verification. 


1 Introduction 


Automated reasoning techniques have been successfully applied to a variety of 
coloring problems ranging from the classical computer-assisted proof of the Four 
Color Theorem [1], to progress on the Hadwiger-Nelson problem [21], or im- 
proving the bounds on Ramsey-like numbers [19]. This article contributes a new 
success story to the area: we show the packing chromatic number of the infi- 
nite square grid to be 15, thus solving via automated reasoning techniques a 
combinatorial problem that had remained elusive for over 20 years. 

The notion of packing coloring was introduced in the seminal work of God- 
dard et al. [10], and since then more than 70 articles have studied it [3], estab- 
lishing it as an active area of research. Let us consider the following definition. 


Definition 1. A packing k-coloring of a simple undirected graph G = (V, E) is a 
function f from V to {1,...,k} such that for any two distinct vertices u,v € V, 
and any color c € {1,...,k}, it holds that f(u) = f(v) = c implies d(u,v) > c. 


* Both authors are supported by the U.S. National Science Foundation under grant 
CCF-2015445. 
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Note that by changing the last condition to d(u,v) > 1 we recover the stan- 
dard notion of coloring, thus making packing colorings a natural variation of 
them. Intuitively, in a packing coloring, larger colors forbid being reused in a 
larger region of the graph around them. Indeed, packing colorings were origi- 
nally presented under the name of broadcast coloring, motivated by the problem 
of assigning broadcast frequencies to radio stations in a non-conflicting way [10], 
where two radio stations that are assigned the same frequency need to be at 
distance greater than some function of the power of their broadcast signals. 
Therefore, a large color represents a powerful broadcast signal at a given fre- 
quency, that cannot be reused anywhere else within a large radius around it, 
to avoid interference. Minimizing the number of colors assigned can thus be in- 
terpreted as minimizing the pollution of the radio spectrum. The literature has 
preferred the name packing coloring ever since [3]. 

Analogously to the case of standard colorings, we can naturally define the 
notion of packing chromatic number, and study its computation. 


Definition 2. Given a graph G = (V,E), define its packing chromatic number 
Xp(G) as the minimum value k such that G admits a packing k-coloring. 


Example 1. Consider the infinite graph with vertex set Z and with edges between 
consecutive integers, which we denote as Z!. A packing 3-coloring is illustrated 
in Figure 1. On the other hand, by examination one can observe that it is im- 
possible to obtain a packing 2-coloring for Zt. 


Fig. 1: Illustration of a packing 3-coloring for Z+. 


While Example 1 shows that y,(Z') = 3, the question of computing y,(Z?), 
where Z? is the graph with vertex set Z x Z and edges between orthogonally 
adjacent points (i.e., points whose 4, distance equals 1), has been open since the 
introduction of packing colorings by Goddard et al. [10]. On the other hand, it 
is known that x,(Z®) = co (again considering edges between points whose £1 
distance equals 1) [9]. The problem of computing 3 < y,(Z) < oo has received 
significant attention, and it is described as “the most attractive [of the packing 
coloring problems over infinite graphs]” by BreSar et al. [3]. We can now state 
our main theorem, providing a final answer to this problem. 


Theorem 1. x,(Z?) = 15. 


An upper bound of 15 had already been proved by Martin et al. [18], who 
found a packing 15-coloring of a 72 x 72 grid that can be used for periodically 
tiling the entirety of Z?. Therefore, the main contribution of our work consists 
of proving that 14 colors are not enough for Z?. Table 1 presents a summary of 
the historical progress on computing xp(Z?). It is worth noting that amongst the 
computer-generated proofs (i.e., all since Soukal and Holub [22] in 2010), ours is 
the first one to be formally verified, see Section 4. 
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Table 1: Historical summary of the bounds known for y,(Z?). 


Year Citation Approach Lower bound Upper bound 
2002 Goddard et al. [10] Manual 9 23 
2002 Schwenk [20] Unkown 9 22 
2009 Fiala et al. [8] Manual + Computer 10 23 
2010 Soukal and Holub [22] Simulated Annealing 10 17 
2010 Ekstein et al. [7] Brute Force Program 12 17 
2015 Martin et al. [17] SAT solver 13 16 
2017 Martin et al. [18] SAT solver 13 15 
2022 Subercaseaux and Heule [23] SAT solver 14 15 
2022 This article SAT solver 15 15 


For any k > 4, the problem of determining whether a graph G admits a 
packing 4-coloring is known to be NP-hard [10], and thus we do not expect a 
polynomial time algorithm for computing y,(-). This naturally motivates the use 
of satisfiability (SAT) solvers for studying the packing chromatic number of finite 
subgraphs of Z?. The rest of this article is thus devoted to proving Theorem 1 
by using automated reasoning techniques, in a way that produces a proof that 
can be checked independently and that has been checked by verified software. 


2 Background 


We start by recapitulating the components used to obtain a lower bound of 
14 in our previous work [23]. Naturally, in order to prove a lower bound for 
Z? one needs to prove a lower bound for a finite subgraph of it. As in earlier 
work, we consider disks (i.e., 2-dimensional balls in the £,-metric) as the finite 
subgraphs to study [23]. Concretely, let D,(v) be the subgraph induced by 
{u € V(Z?) | d(u,v) < r}. To simplify notation, we use D, as a shorthand 
for D,((0,0)), and we let D,, be the instance consisting of deciding whether 
D, admits a packing k-coloring. Moreover, let Dr k,e be the instance D, k but 
enforcing that the central vertex (0,0) receives color c (Fig. 2). 

For example, a simple lemma of Subercaseaux and Heule [23, Proposition 5] 
proves that the unsatisfiability of D3,6,3 is enough to deduce that y,(Z?) > 7. We 
will prove a slight variation of it (Lemma 2) later on in order to prove Theorem 1, 
but for now let us summarize how they proved that D12,13,12 is unsatisfiable. 


Encodings. The direct encoding for D, k,e consists simply of variables x, 4 
stating that vertex v gets color t, as well as the following clauses: 


1. (at-least-one-color clauses, ALOC) Vis Lyt, WE V, 
2. (at-most-one-distance clauses, AMOD) 


Tut VTyt, Vte{l,...,k},Vuvevst.0<d(u,v) < t, 
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Fig. 2: Illustration of satisfying assignments for D3 7,3 and D366. On the other 
hand, D3,6,3 is not satisfiable. 


3. (center clause) (0,0) c+ 


This amounts to O(r?k?) clauses [23]. The recursive encoding is significantly 
more involved, but it leads to only O(r?k log k) clauses asymptotically. Unfor- 
tunately, the constant involved in the asymptotic expression is large, and this 
encoding did not give them practical speed-ups [23]. 


Cube And Conquer. Introduced by Heule et al. [13], the Cube And Con- 
querapproach aims to split a SAT instance y into multiple SAT instances y,..., 
Pm in such a way that y is satisfiable if, and only if, at least one of the instances 
gi is satisfiable; thus allowing to work on the different instances y; in parallel. 
If Y = (c1 V c2 V+ ++ V Cm) is a tautological DNF, then we have 


SAT(y) = > SAT(pAW) — > SAT (Ven =) <> SAT (V ai) ; 


i=1 i=1 


where the different y; := (yA c;) are the instances resulting from the split. 

Intuitively, each cube c; represents a case, i.e., an assumption about a sat- 
isfying assignment to y, and soundness comes from w being a tautology, which 
means that the split into cases is exhaustive. If the split is well designed, then 
each y; is a particular case that is substantially easier to solve than y, and thus 
solving them all in parallel can give significant speed-ups, especially consider- 
ing the sequential nature of CDCL, at the core of most solvers. Our previous 
work [23] proposed a concrete algorithm to generate a split, which already results 
in an almost linear speed-up, meaning that by using 128 cores, the performance 
gain is roughly a x60 factor. 


Symmetry Breaking. The idea of symmetry breaking [6] consists of exploiting 
the symmetries that are present in SAT instances to speed-up computation. In 
particular, D, k,¿ instances have 3 axes of symmetry (i.e., vertical, horizontal, and 
diagonal) which allowed for close to an 8-fold improvement in performance for 
proving D12,13,12 to be unsatisfiable. The particular use of symmetry breaking in 
our previous approach [23] was happening at the Cube And Conquer level, where 
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out of the sub-instances Y;,..., Pm produced by the split, only a 1/s-fraction of 
them had to be solved, as the rest were equivalent under isomorphism. 


Verification. Arguably the biggest drawback of our previous approach proving 
a lower bound of 14 is that it lacked the capability of generating a computer- 
checkable proof. To claim a full solution to the 20-year-old problem of computing 
x,(Z*) that is accepted by the mathematics community, we deem paramount a 
fully verifiable proof that can be scrutinized independently. 

The most commonly-used proofs for SAT problems are expressed in the 
DRAT clausal proof system [11]. A DRAT proof of unsatisfiability is a list of 
clause addition and clause deletion steps. Formally, a clausal proof is a list of 
pairs (s1,C1),..-,(Sm,Cm), where for each i € 1,...,m, si E€ {a,d} and C; is 
a clause. If s; = a, the pair is called an addition, and if s; = d, it is called 
a deletion. For a given input formula yo, a clausal proof gives rise to a set of 
accumulated formulas y; (i € {1,...,m}) as follows: 


o Pi—-1 U{C;} if S; =a 
A Yi-1 \ {C;} if S; = d 


Each clause addition must preserve satisfiability, which is usually guaranteed 
by requiring the added clauses to fulfill some efficiently decidable syntactic cri- 
terion. The main purpose of deletions is to speed up proof checking by keeping 
the accumulated formula small. A valid proof of unsatisfiability must end with 
the addition of the empty clause. 


3 Optimizations 


Even with the best choice of parameters for our previous approach, solving the 
instance D12,13,12 takes almost two days of computation with a 128-core ma- 
chine [23]. In order to prove Theorem 1, we will require to solve an instance 
roughly 100 times harder, and thus several optimizations will be needed. In fact, 
we improve on all aspects discussed in Section 2; we present five different forms 
of optimization that are key to the success of our approach, which we summarize 
next. 


1. We present a new encoding, which we call the plus encoding that has concep- 
tual similarities with the recursive encoding of Subercaseaux and Heule [23], 
while achieving a significant gain in practical efficiency. 

2. We present a new split algorithm that works substantially better than the 
previous split algorithm when coupled with the plus encoding. 

3. We improve on symmetry breaking by using multiple layers of symmetry- 
breaking clauses in a way that exploits the design of the split algorithm to 
increase performance. 

4. We study the choice of color to fix at the center, showing that one can gain 
significantly in performance by making instance-based choices; for example, 
Dj2,13,6 can be solved more than three times as fast as D12,13,12 (the instance 
used by Subercaseaux and Heule [23]). 
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5. We introduce a new and extremely simple kind of clauses called ALOD clauses, 
which improve performance when added to the other clauses of any encoding 
we have tested. 


The following subsections present each of these components in detail. 


3.1 “Plus”: a New Encoding 


Despite the asymptotic improvement of the recursive encoding of Subercaseaux 
and Heule [23], its contribution is mostly of “theoretical interest” as it does 
not improve solution times. Nonetheless, that encoding suggests the possibil- 
ity of finding one that is both more succinct than the direct encoding and that 
speed-ups computation. Our path towards such an encoding starts with Bounded 
Variable Addition (BVA) [16], a technique to automatically re-encode CNF for- 
mulas by adding new variables, with the goal of minimizing their resulting size 
(measured as the sum of the number of variables and the number of clauses). 
BVA can significantly reduce the size of D, k, instances, even further than the 
recursive encoding. Moreover, BVA actually speeds-up computation when solv- 
ing the resulting instances with a CDCL solver, see Table 2. Figure 3 compares 
the number of AMOD clauses between the direct encoding and the BVA encod- 
ing; for example in the direct encoding, for D14 color 10 would require roughly 
30000 clauses, whereas it requires roughly 3500 in the BVA encoding. It can be 
observed as well in Figure 3 that the direct encoding grows in a very structured 
and predictable way, where color c in D, requires roughly r?c? clauses. On the 
other hand, arguably because of its locally greedy nature, the results for BVA 
are far more erratic, and roughly follow a 4r? lg c curve. 

The encoding resulting from BVA does not perform particularly well when 
coupled with the split algorithm of Subercaseaux and Heule. Indeed, Table 2 
shows that while BVA heavily improves runtime under sequential CDCL, it 
does not provide a meaningful advantage when using Cube And Conquer. Fur- 
thermore, encodings resulting from BVA are hardly interpretable, as BVA uses 


direct encoding bva encoding plus encoding 

- r J pT T p~ T 
l Jl Joi f 
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Fig. 3: Comparison of the size of the at-most-one-color clauses between the direct 
encoding and the BVA-encoding, for D4 up to D14 and colors {4,...,10}. 
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Table 2: Comparison between the different encodings. Cube And Conquer ex- 
periments were performed with the approach of Subercaseaux and Heule [23] 
(parameters F = 5,d = 2) on a 128-core machine. Hardware details in Section 5. 


direct encoding bva encoding | plus encoding 
Ds,10,5  De,11,6 |D5,10,5 De,11,6 | Ds,10,5 De,11,6 


Number of variables 610 935 973 1559 673 1039 
Number of clauses 10688 21086 2313 3928 4063 7548 
CDCL runtime (s) 255.12 10774.79 | 39.88 2539.38 15.90 811.66 


Cube-and-conquer wall-clock (s)| 0.77 26.20 | 0.78 17.97 0.50 6.68 


a locally greedy strategy for introducing new variables. As a result, the design 
of a split algorithm that could work well with BVA is a very complicated task. 
Therefore, our approach consisted of reverse engineering what BVA was doing 
over some example instances, and using that insight to design a new encoding 
that produces instances of size comparable to those generated by BVA while 
being easily interpretable and thus compatible with natural split algorithms. 
By manually inspecting BVA encodings one can deduce that a fundamental 
part of their structure is what we call regional variables/clauses. A regional 
variable rg. is associated with a set of vertices S and a color c, meaning that at 
least one vertex in S receives color c. Let us illustrate their use with an example. 


Example 2. Consider the instance De,11, and let us focus on the at-most-one- 
distance (AMOD) clauses for color 4. Figure 4a depicts two regional clauses: one 
in orange (vertices labeled with a), and one in blue (vertices labeled with £8), 
each consisting of 5 vertices organized in a plus (+) shape. We thus introduce 
variables orange, and Tplue,4, defined by the following clauses: 


- Torange,4 V Vo has label a ©v,4> 

: Tblue,4 v Vo has label 8 Ly, 

- Torange,4 V Fy,4, for each v with label a, 
. Tblue,4 V £4, for each v with label £. 


Bm whN re 


The benefit of introducing these two new variables and 2 + (5-2) = 12 
additional clauses will be shown now, when using them to forbid conflicts more 
compactly. Indeed, each vertex labeled with a or 8 participates in |D4|— 1 = 40 
AMOD clauses in the direct encoding, which equals a total of 10-40 — (1?) = 355 
clauses for all of them (subtracting the clauses counted twice). However, note 
that all 36 vertices shaded in light orange are at distance at most 4 from all 
vertices labeled with a, and thus they are in conflict with rorange,4. This means 
that we can encode all conflicts between a-vertices and orange-shaded vertices 
with 36 clauses. The same can be done for (-vertices and the 36 vertices shaded 
in light blue. Moreover, all pairs of vertices (x,y) with z being an a-vertex 
and y being a -vertex are in conflict, which we can represent simply with the 
clause (Torange,4 V Tblue,4), instead of 5-5 = 25 pairwise clauses. We still need, 
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Fig. 4: Illustrations for P¢,11,6. 


however, to forbid that more than one a-vertex receives color 4, and the same 
for 3-vertices, which can be done by simply adding all 2- (3) = 20 amon clauses 
between all pairs. In total, the total number of clauses involving a or 8 vertices 
has gone down to 12+2-36+20+1 = 105 clauses, from the original 355 clauses, 
by merely adding two new variables. 


As shown in Example 2, the use of regional clauses can make encodings more 
compact, and this same idea scales even better for larger instances when the 
regions are larger. A key challenge for designing a regional encoding in this man- 
ner is that it requires a choice of regions (which can even be different for every 
color). After trying several different strategies for defining regions, we found one 
that works particularly well in practice (despite not yielding an optimal num- 
ber for the metric #variables + #clauses), which we denote the plus encoding. 
The plus encoding is based on simply using “+” shaped regions (i.e., Dı) for all 
colors greater than 3, and to not introduce any changes for colors 1,2 and 3 as 
they only amount to a very small fraction of the total size of the instances we 
consider. We denote with Py. the plus encoding of the diamond of size d with 
k colors, and the centered being colored with c. Figure 4b illustrates Pg 11,6. In- 
terestingly, the BVA encoding opted for larger regions for the larger colors, using 
for example D2’s or D3’s as regions for color 14. We have experimentally found 
this to be very ineffective when coupled with our split algorithms. In terms of the 
locations of the “+” shaped regions, we have placed them manually through an 
interactive program, arriving to the conclusion that the best choice of locations 
consists of packing as many regions as possible and as densely around the center 
as possible. A more formal presentation of all the clauses involved in the plus 
encoding is presented in the extended arXiv version [24] of this paper, but all 
its components have been illustrated in Example 2. 
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The exact number of clauses resulting from the plus encoding is hard to 
analyze precisely, but it is clear that asymptotically it only improves from the 
direct encoding by a constant multiplicative factor. Figure 3 and Table 2 illustrate 
the compactness of the plus encoding over particular instances, and its increase in 
efficiency both for CDCL solving as well as with the Cube And Conquer approach 
of Subercaseaux and Heule [23]. 


3.2 Symmetry Breaking 


Another improvement of our approach is a static symmetry-breaking technique, 
while Subercaseaux and Heule [23] achieved symmetry breaking by discarding 
all but 1/s of the cubes. We cannot do this easily since the plus encoding does 
not have an 8-fold symmetry. Instead it has a 4-fold symmetry (see Figure 4b). 
We add symmetry breaking clauses directly on top of the direct encoding (i.e., 
instead of using it after a Cube And Conquer split), as D,.,,- has indeed an 8-fold 
symmetry (see Figure 5b). Concretely, if we consider a color t, it can only appear 
once in the Dj; 2), as if it appeared more than once said appearances would be 
at distance < t. Given this, we can assume without loss of generality that if 
there is one appearance of t in Dj4,/2|, then it appears with coordinates (a, b) 
such that a > 0A 6 > a. We enforce this by adding negative units of the form 
Tajt for every pair (i,j) € Djt/2; such that i < 0V j < i. This is illustrated 
in Figure 5b for Ds 19. Note however that this can only be applied to a single 
color t, as when a vertex in the north-north-east octant gets assigned color t, 
the 8-fold symmetry is broken. However, if the symmetry breaking clauses have 
been added for color t, and yet t does not appear in D\;/2), then there is still an 
8-fold symmetry in the encoding we can exploit by breaking symmetry on some 
other color t. This way, our encoding uses L = 5 layers of symmetry breaking, 
for colors k,k — 1,...,k— L + 1. At each layer i, where symmetry breaking is 
done over color k — i, except for the first (i.e., i > 0), we need to concatenate a 
clause 


k 
SymmetryBroken, := VV VV X(a,b),t 
t=k—-i (a,b)ED 4/2) 
O<a<b 


to each symmetry breaking clause, so that symmetry breaking is applied only 
when symmetry has not been broken already. Table 3 (page 14) illustrates the 
impact of this symmetry breaking approach, yielding close to a x40 speed-up 
for D6,11,6- 


3.3 At-Least-One-Distance clauses 


Yet another addition to our encoding is what we call At-Least-One-Distance 
(ALOD) clauses, which consist on stating that, for every vertex v, if we consider 
D,(v), then at least one vertex in D,(v) must get color 1. Concretely, the At- 
Least-One-Distance clause corresponding to a vertex v = (i, j) is 


Cy = Xij), 1 V ©(641,9),1 V Efi—1,5),1 V Tli j+1),1 V XGG,j-1),1- 
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Fig. 5: The effect of adding ALOD clauses (left) and symmetry-breaking (right). 


Note that adding these clauses preserves satisfiability since they are blocked 
clauses [15]; this can be seen as follows. If no vertex in D (v) gets assigned color 
1, then we can simply assign v,,1, thus satisfying the new clause Cy. 

The purpose of ALOD clauses can be described as incentives towards assigning 
color 1 in a chessboard pattern (see Figure 5a), which seems to simplify the rest 
of the computation. Empirically, their addition improves runtimes; see Table 3. 


3.4 Cube And Conquer Using Auxiliary Variables 


The split of Subercaseaux and Heule [23] is based on cases about the £y, vari- 
ables of the direct encoding, and specifically using vertices v that are close to 
the center and colors c that are in the top-t colors for some parameter t. 

Our algorithm is instead based on cases only around the new regional vari- 
ables rs <, which appears to be key for exploiting their use in the encoding. 

More concretely, our algorithm, which we call PTR, is roughly based on split- 
ting the instance into cases according to which out of the R regions that are 
closest to the center get which of the T highest colors (noting that a region can 
get multiple colors). A third parameter P indicates the maximum number of 
positive literals in any cube of the split. More precisely, there are cubes with i 
positive literals for i € {0,1,...,P—1,P}, and the set of cubes with i positive 
literals is constructed by PTR as follows: 


1. Let R be the set of R regions that are the closest to the center, and 7 the 
set consisting of the T highest colors (i.e., {k,k —1,...,4 —-T +1}). 

2. For each of the Ri tuples § € Rt, we create (7) cubes as described in the 
next step. 

3. For each subset Q C T with size |Q| = i, let q1,...,qi be its elements 
in increasing order, and then create a cube with positive literals r coe for 


j € {1,...,i}. Then, if i < P, add to the cube negative literals Tey for 
j € {1,...,¢} and every qe Z Q. 


Lemma 1. The cubes generated by the PTR algorithm form a tautology. 
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The proof of Lemma 1 is quite simple, and we refer the reader to the proof 
of Lemma 7 in Subercaseaux and Heule [23] for a very similar one. Moreover, 
because our goal is to have a verifiable proof, instead of relying on Lemma 1, we 
test explicitly that the cubes generated by our algorithm form a tautology in all 
the instances mentioned in this paper. Pseudo-code for PTR is presented in the 
extended arXiv version of this paper [24]. 


3.5 Optimizing the Center Color 


Our previous work [23] argued that for an instance D,.;,, one should fix the color 
of the central vertex to min(r, k). However, our experiments suggest otherwise. 
As the proof of Lemma 2 (in extended arXiv version [24]) implies, we are allowed 
to fix any color in the center, and as long as the resulting instance is unsatisfiable, 
that will allow us to establish the same lower bound. It turns out that the 
choice of the center color can dramatically affect performance, as shown for 
instance D1,13 (the one used to prove x,(Z*) > 14) in Figure 6. Interestingly, 
performance does not change monotonically with the value fixed in the center. 
Intuitively, it appears that fixing smaller colors in the center is ineffective as they 
impose restrictions on a small region around the center, while fixing very large 
colors in the center does not constrain the center much; for example, on the one 
hand, fixing a 1 or 2 in the center does not seem to impose any serious constraints 
on solutions. On the other hand, when a 12 is fixed in the center (as in our 
previous work [23]), color 6 can be used 5 times in Dg, whereas if color 6 is fixed 
in the center, it can only be used once in Dg. The apparent advantage of fixing 
12 in the center (that it cannot occur anywhere else in Dj2,13), is outweighed by 
the extra constraints around the center that fixing color 6 imposes; Subercaseaux 
and Heule already observed that most conflicts between colors occur around the 
center [23]), thus explaining why it makes sense to optimize in that area. 

The main result of Subercaseaux and Heule [23] is the unsatisfiability of 
D42,13,12, which required 45 CPU hours using the same SAT solver and similar 
hardware. Let Pj, denote Pax,- with ALOD clauses and symmetry-breaking 
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Fig. 6: The impact of the color in the center (c) on the performance for Ph 13 e 
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Fig. 7: Illustration of the verification pipeline. 


predicates. We show unsatisfiability of Pis 13,12 in 1.18 CPU hours and of P75 13,6 
in 0.34 CPU hours. So the combination of the plus encoding and the improved 
center reduces the computational costs by two orders of magnitude. 


4 Verification 


Our pipeline proves that, in order to trust xp(Z?) = 15 as a result, the only com- 
ponent that requires unverified trust is the direct encoding of D15,14,6. Indeed, 
let Pfs 14,6 be the instance Pis,14,6 with ALOD-clauses and 5 layers of symmetry 
breaking clauses, and let w = {c1,...,Cm} be the set of cubes generated by the 
PTR algorithm with parameters P = 6,7 = 7, R = 9. We then prove: 


. that D15,14,6 is satisfiability equivalent to Pfs 14,6- 

. the DNF JW = c1 V c2 V ++- V Cm is a tautology. 

. each instance (Př 14.6 ^ Ci), for ci € w is unsatisfiable. 

. hence the negation of each cube is implied by P¥s 14,6- 

. since 7 is a tautology, its negation Nı5,14,6 is unsatisfiable. 


aor WN FR 


As a result, Theorem 1 relies only on our implementation of D5 14,6. For- 
tunately, this is quite simple, and the whole implementation is presented in the 
extended arXiv version of this paper [24]. Figure 7 illustrates the verification 
pipeline, and the following paragraphs detail its different components. 


Symmetry Proof. The first part of the proof consists in the addition of 
symmetry-breaking predicates to the formula. This part needs to go before the 
re-encoding proof, because the plus encoding does not have the 8-fold symmetry 
of the direct encoding. Each of the clauses in the symmetry-breaking predicates 
have the substitution redundancy (SR) property [5]. This is a very strong redun- 
dancy property and checking whether a clause C has SR w.r.t. a formula ọ is 
NP-complete. However, since we know the symmetry, it is easy to compute a SR 
certificate. There exists no SR proof checker. Instead, we implemented a proto- 
type tool to convert SR proofs into DRAT for which formally verified checkers 
exists. Our conversion is similar to the approach to converted propagation re- 
dundancy into DRAT [12]. The conversion can significantly increase the size of 
the proof, but the other proof parts are typically larger for harder formulas, thus 
the size is acceptable. 
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Re-encoding Proof. After symmetry breaking, the formula encoding is opti- 
mized by transforming the direct encoding into the plus encoding and adding the 
ALOD clauses. This part of the proof is easy. All clauses in the plus encoding and 
all ALOD clauses have the RAT redundancy property w.r.t. the direct encoding. 
This means that we can add all these clauses with a single addition step per 
clause. Afterward, the clauses that occur in the direct encoding but not in the 
plus encoding are removed using deletion steps. 


Implication Proof. The third part of the proof expresses that the formula 
cannot be satisfied with any of the cubes from the split. For easy problems, 
one can avoid splitting and just use the empty cube as tautological DNF. For 
harder problems, splitting is crucial. We solve D15 14,6 using a split with just 
over 5 million cubes. Using a SAT solver to show that the formula with a cube 
is unsatisfiable shows that the negative of the cube is implied by the formula. 
We can derive all these implied clauses in parallel. The proofs of unsatisfiability 
can be merged into a single implication proof. 


Tautology Proof. The final proof part needs to show that the negation of the 
clauses derived in the prior steps form a tautology. In most cases, including ours, 
the cubes are constructed using a tree-based method. This makes the tautology 
check easy as there exists a resolution proof from the derived clauses to the 
empty clause using m — 1 resolution steps with m denoting the number of cubes. 
This part can be generated using a simple SAT call. 

The final proof merges all the proof parts. In case the proof parts are all in 
the DRAT format, such as our proof parts, then they can simply be merged by 
concatenating the proofs using the order presented above. 


5 Experiments 


Experimental Setup. In terms of hardware, all our experiments were run in 
the Bridges2 [4] supercomputer. Each node has the following specifications: Two 
AMD EPYC 7742 CPUs, each with 64 cores, 256MB of L3 cache, and 512GB 
total RAM memory. Our code and various formulas are publicly available at the 
repository https: //github.com/bsubercaseaux/PackingChromaticTacas. In 
terms of software, all sequential experiments were run on state-of-the-art solver 
CaDiCaL [2], while parallel experiments with Cube And Conquer were run us- 
ing a new implementation of parallel iCaDiCaL because it supports incremental 
solving [13] while being significantly faster than iLingeling. 


Effectiveness of the Optimizations. We evaluated the optimizations to the 
direct encoding as proposed in Section 3: the plus encoding, the addition of the 
ALOD clauses, and the new symmetry breaking. The results are shown in Table 3. 
We picked Dg 11,6 for this evaluation since it is the largest diamond that can still 
be solved within a couple of hours on a single core. 

The main conclusion is that the optimizations significantly improve the run- 
time. A comparison between the direct encoding without symmetry breaking and 
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the plus encoding with symmetry breaking and the ALOD clauses shows that the 
latter can be solved roughly 200x faster. Table 3 shows all 8 possible configu- 
rations. Turning on any of the optimizations always improves performance. The 
effectiveness of the plus encoding and ALOD clauses is somewhat surprising: the 
speed-up factor obtained by re-encoding typically does not exceed the factor by 
which the formula size is reduced. In this case, the reduction factor in formula 
size is less than 3, while the speed-up is larger than 13 (see the difference be- 
tween the first and second row of Table 3). Moreover, we are not aware of the 
effectiveness of adding blocked clauses. Typically SAT solvers remove them. 
We also constructed DRAT proofs of the optimizations (shown as derivation 
in the table) and the solver runtime. We merged them into a single DRAT proof 
by concatenating the files. The proofs were first checked with the drat-trim 
tool, which produced LRAT proofs. These LRAT files were validated using the 
formally-verified cake-lpr checker. The size of the DRAT proofs and the check- 
ing time are shown in the table. Note that the checking time for the proofs with 
symmetry breaking is always larger than the solving times. This is caused by 
expressing the symmetry breaking in DRAT resulting in a 436 Mb proof part. 


The Implication Proof. The largest part of the computation consist of show- 
ing that Pj; 4, is unsatisfiable under each of the 5, 217,031 cubes produced by 
the cube generator. The results of the experiments are shown in Figure 8 (left). 
The left plot shows that roughly half of the cubes can be solved in a second 
or less. The average runtime of cubes was 3.35 seconds, while the hardest cube 
required 1584.61 seconds. The total runtime was 4851.38 CPU hours. 

For each cube, we produced a compressed DRAT proof (the default output of 
CaDiCaL). Due to the lack of hints in DRAT proofs, they are somewhat complex 
to validate using a formally-verified checker. Instead, we use the tool drat-trim 
to trim the proofs and add hints. The result are uncompressed LRAT files, which 
we validate using the formally-verified checker cake_lpr. The verification time 
was 4336.93 CPU hours, so slightly less than the total runtime. 

The sizes of each of the implication proofs show a similar distribution, as 
depicted in Figure 8 (right). Most proofs are less than 10 MB in size. The 


Table 3: Evaluating the effectiveness of the optimizations on D6,11,6- 


sym ALOD plus #var cls runtime derivation proof check 
935 21086 10741.69 Ob 11.99 Gb 31731.20 
x 1039 7548 809.65 149 Kb 1.29 Gb 1720.82 
x 935 21171 8422.38 1.6 Kb 8.11 Gb 21732.74 
X X 1039 7633 389.71 151 Kb 1.29 Gb 1708.21 
x 935 21286 273.19 436 Mb 0.63 Gb 1390.04 
x x 1039 7748 66.74 436 Mb 0.14 Gb = 1022.42 
x x 935 21371 252.71 436 Mb 0.68 Gb 1359.05 
x x x 1039 7833 55.56 436 Mb 0.10 Gb 997.90 
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103 L verification time (s) — uncompressed LRAT (Mb) 
— solving time (s) — compressed DRAT (Mb) 
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Fig. 8: Cactus plot of solving and verification times in seconds (left) and cactus 
plot of the size of the compressed DRAT proof and uncompressed LRAT proof 
in Mb (right). 


compressed DRAT proofs are generally smaller compared to the LRAT proofs, 
but that is mostly due to compression, which reduces the size by around 70%. 


The Chessboard Conjecture and its Counterexample. Given that color 
1 can be used to fill in 1/2 of Z? in a packing coloring, and the packing color- 
ings found in the past, with 15,16 or 17 colors used color 1 with density 1/2 
in a chessboard pattern [18], it is tempting to assume that this must always be 
the case. This way, we conjectured that any instance Dy k,e is satisfiable if and 
only if it is with the chessboard pattern. The consequence of the conjecture is 
significant, as if it were true we could fix half of the vertices to color 1, thus 
massively reducing the size of the instance and its runtime. Unfortunately, this 
conjecture happens to be false, with the smallest counterexample being D14,14,6 
as illustrated in Figure 9, which deviates from the chessboard pattern in only 2 
vertices. We have proved as well that no solution for D14,14,6 deviating in only 
1 vertex from the chessboard pattern exists. 


Proving the Lower Bound. In order to prove Theorem 1, we require the 
following 3 lemmas, from where the conclusion easily follows. 


Lemma 2. If Dıs 14,6 is unsatisfiable, then xp(Z?) > 15. 
Lemma 3. If Dı5,14,6 is satisfiable, then PÌs 14.6 is also satisfiable. 
Lemma 4. Př 14, is unsatisfiable. 


We have obtained computational proofs of Lemma 3 and Lemma 4 as de- 
scribed above, and thus it only remains to prove Lemma 2, which we include in 
the appendix. We can thus proceed to our main proof. 


Proof (of Theorem 1). Since Martin et al. proved that y,(Z?) < 15 [18], it 
remains to show x,(Z?) > 15, which by Lemma 2 reduces to proving Lemma 3 
and Lemma 4. We have proved these lemmas computationally, obtaining a single 
DRAT proof as described in Section 4. The total solving time was 4851.31 CPU 
hours, while the total checking time of the proofs was 4336.93 CPU hours. The 
total size of the compressed DRAT proof is 34 terabytes, while the uncompressed 
LRAT proof weighs 122 terabytes. 
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Fig. 9: A valid coloring of D14,14,6. No valid coloring exists for this grid with a 
full chessboard pattern of 1’s. 


6 Concluding Remarks and Future Work 


We have proved x,(Z?) = 15 by using several SAT-solving techniques, in what 
constitutes a new success story for automated reasoning tools applied to com- 
binatorial problems. Moreover, we believe that several of our contributions in 
this work might be applicable to other settings and problems. Indeed, we have 
obtained a better encoding by reverse engineering BVA, and designed a split 
algorithm that works well coupled with the new encoding; this experience sug- 
gests the split-encoding compatibility as a new key variable to pay attention to 
when solving combinatorial problems under the Cube And Conquer paradigm. 
As for future work, it is natural to study whether our techniques can be used to 
improve other known bounds in the packing-coloring area (see e.g., [3]), as well 
as to other families of coloring problems, such as distance colorings [14]. 


The Packing Chromatic Number of the Infinite Square Grid is 15 405 


Acknowledgements We thank the Pittsburgh Supercomputing Center for al- 
lowing us to use Bridges2 [4] in our experiments. We thank as well the anonymous 
reviewers for their comments and suggestions. We also thank Donald Knuth for 
his thorough comments and suggestions. The first author thanks the Facebook 
group “actually good math problems”, from where he first learned about this 
problem, and in particular to Dylan Pizzo for his post about this problem. 


References 


1. Appel, K., Haken, W.: Every planar map is four colorable. Part I: Discharging. 
Illinois Journal of Mathematics 21(3), 429 — 490 (1977) 

2. Biere, A., Fazekas, K., Fleury, M., Heisinger, M.: CaDiCaL, Kissat, Paracooba, 
Plingeling and Treengeling entering the SAT Competition 2020. In: Balyo, T., 
Froleyks, N., Heule, M., Iser, M., Jarvisalo, M., Suda, M. (eds.) Proc. of SAT 
Competition 2020 — Solver and Benchmark Descriptions. Department of Computer 
Science Report Series B, vol. B-2020-1, pp. 51-53. University of Helsinki (2020) 

3. BreSar, B., Ferme, J., Klavžar, S., Rall, D.F.: A survey on packing colorings. Dis- 
cussiones Mathematicae Graph Theory 40(4), 923 (2020) 

4. Brown, S.T., Buitrago, P., Hanna, E., Sanielevici, S., Scibek, R., Nystrom, N.A.: 
Bridges-2: A Platform for Rapidly-Evolving and Data Intensive Research, pp. 1—4. 
Association for Computing Machinery, New York, NY, USA (2021) 

5. Buss, S., Thapen, N.: DRAT proofs, propagation redundancy, and extended res- 
olution. In: Janota, M., Lynce, I. (eds.) Theory and Applications of Satisfiability 
Testing — SAT 2019. pp. 71-89. Springer International Publishing, Cham (2019) 

6. Crawford, J., Ginsberg, M., Luks, E., Roy, A.: Symmetry-breaking predicates for 
search problems. In: Proc. KR’96, 5th Int. Conf. on Knowledge Representation and 
Reasoning, pp. 148-159. Morgan Kaufmann (1996) 

7. Ekstein, J., Fiala, J., Holub, P., Lidicky, B.: The packing chromatic number of the 
square lattice is at least 12. CoRR abs/1003.2291 (2010), http://arxiv.org/ 
abs/1003.2291 

8. Fiala, J., Klavžar, S., Lidický, B.: The packing chromatic number of infinite product 
graphs. Eur. J. Comb. 30(5), 1101-1113 (jul 2009) 

9. Finbow, A.S., Rall, D.F.: On the packing chromatic number of some lattices. Dis- 
crete Applied Mathematics 158(12), 1224-1228 (2010), traces from LAGOS’07 IV 
Latin American Algorithms, Graphs, and Optimization Symposium Puerto Varas 
- 2007 

10. Goddard, W., Hedetniemi, S., Hedetniemi, S., Harris, J., Rall, D.: Broadcast chro- 
matic numbers of graphs. Ars Comb. 86 (01 2008) 

11. Heule, M.J.H.: The DRAT format and drat-trim checker. CoRR abs/1610.06229 
(2016), http: //arxiv. org/abs/1610.06229 

12. Heule, M.J.H., Biere, A.: What a difference a variable makes. In: Beyer, D., Huis- 
man, M. (eds.) Tools and Algorithms for the Construction and Analysis of Systems. 
pp. 75-92. Springer International Publishing, Cham (2018) 

13. Heule, M.J.H., Kullmann, O., Wieringa, S., Biere, A.: Cube and conquer: Guid- 
ing CDCL SAT solvers by lookaheads. In: Eder, K., Lourenço, J., Shehory, O. 
(eds.) Hardware and Software: Verification and Testing. pp. 50-65. Springer Berlin 
Heidelberg, Berlin, Heidelberg (2012) 

14. Kramer, F., Kramer, H.: A survey on the distance-colouring of graphs. Discrete 
Mathematics 308(2), 422—426 (2008) 


406 B. Subercaseaux and M. J. H. Heule 


15. Kullmann, O.: On a generalization of extended resolution. Discrete Applied Math- 
ematics 96-97, 149-176 (1999) 

16. Manthey, N., Heule, M.J.H., Biere, A.: Automated reencoding of boolean formulas. 

In: Proceedings of Haifa Verification Conference 2012 (2012) 

17. Martin, B., Raimondi, F., Chen, T., Martin, J.: The packing chromatic number of 

the infinite square lattice is less than or equal to 16 (2015), http://arxiv.org/ 

abs/1510.02374v1 

18. Martin, B., Raimondi, F., Chen, T., Martin, J.: The packing chromatic number 

of the infinite square lattice is between 13 and 15. Discrete Applied Mathematics 

225, 136-142 (2017) 

19. Neiman, D., Mackey, J., Heule, M.J.H.: Tighter bounds on directed Ramsey num- 
ber R(7). Graphs and Combinatorics 38(5), 156 (2022) 

20. Schwenk, A.: private communication with Wayne Goddard. (2002) 

21. Soifer, A.: The Hadwiger—Nelson Problem, pp. 439-457. Springer International 
Publishing, Cham (2016) 

22. Soukal, R., Holub, P.: A note on packing chromatic number of the square lattice. 
The Electronic Journal of Combinatorics 17(1), ##N17 (Mar 2010) 

23. Subercaseaux, B., Heule, M.J.H.: The Packing Chromatic Number of the Infinite 
Square Grid Is at Least 14. In: Meel, K.S., Strichman, O. (eds.) 25th International 
Conference on Theory and Applications of Satisfiability Testing (SAT 2022). Leib- 
niz International Proceedings in Informatics (LIPIcs), vol. 236, pp. 21:1-21:16. 
Schloss Dagstuhl — Leibniz-Zentrum fiir Informatik, Dagstuhl, Germany (2022) 

24. Subercaseaux, B., Heule, M.J.H.: The packing chromatic number of the infinite 
square grid is 15 (2023), https: //arxiv.org/abs/2301.09757 


Open Access This chapter is licensed under the terms of the Creative Commons Attri- 
bution 4.0 International License (http: //creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter’s Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will 
need to obtain permission directly from the copyright holder. 


®) 


Check for 
updates 


Active Learning for SAT Solver Benchmarking 


Tobias Fuchs®™) ©, Jakob Bach ©, and Markus Iser © 


Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany 
info@tobiasfuchs.de, {jakob.bach,markus.iser}@kit.edu 


Abstract. Benchmarking is a crucial phase when developing algorithms. 
This also applies to solvers for the SAT (propositional satisfiability) prob- 
lem. Benchmark selection is about choosing representative problem in- 
stances that reliably discriminate solvers based on their runtime. In this 
paper, we present a dynamic benchmark selection approach based on 
active learning. Our approach predicts the rank of a new solver among 
its competitors with minimum runtime and maximum rank prediction 
accuracy. We evaluated this approach on the Anniversary Track dataset 
from the 2022 SAT Competition. Our selection approach can predict the 
rank of a new solver after about 10% of the time it would take to run 
the solver on all instances of this dataset, with a prediction accuracy 
of about 92%. We also discuss the importance of instance families in 
the selection process. Overall, our tool provides a reliable way for solver 
engineers to determine a new solver’s performance efficiently. 


Keywords: Propositional satisfiability - Benchmarking - Active learning 


1 Introduction 


One of the main phases of algorithm engineering is benchmarking. This also ap- 
plies to propositional satisfiability (SAT), the archetypal NP-complete problem. 
Benchmarking is, however, quite expensive regarding the runtime of experiments. 
While benchmarking a single SAT solver might still be feasible, developing new, 
competitive SAT solvers requires extensive experimentation with a variety of 
ideas [8,2]. In particular, a new solver idea is rarely best on the first try. Thus, it 
is highly desirable to reduce benchmarking time and discard unpromising ideas 
early, allowing to test more approaches or spend more time on promising ones. 
The field of SAT solver benchmarking is well established, but traditional bench- 
mark selection approaches do not optimize benchmark runtime. Instead, they 
focus on selecting a representative set of instances for scoring solvers [10,15]. For 
the latter, SAT Competitions typically employ the PAR-2 score, i.e., the average 
runtime with a penalty of 27 for timeouts with time-limit 7 [8]. 

In this paper, we present a novel benchmark selection approach based on 
active learning. Our approach can predict the rank of a new solver with high ac- 
curacy in only a fraction of the time needed to evaluate the complete benchmark. 
Definition 1 specifies the problem we address. 


© The Author(s) 2023 
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Definition 1 (New-Solver Problem). Given solvers A, instances T, run- 
times r: AX I —> [0,7] with time-limit T, and a new solver â ¢ A, incrementally 
select benchmark instances from T to maximize the confidence in predicting the 
rank of @ while minimizing the total benchmark runtime. 


Note that our scenario assumes knowing the runtimes of all solvers, except 
the new one, on all instances. One could also imagine a collaborative filtering 
scenario, where runtimes are only partially known [23,25]. 

Our approach satisfies several desirable criteria for benchmarking: Rather 
than outputting a binary classification, i.e., whether the new solver is worse 
than an existing solver or not, we provide a scoring function that shows by which 
margin a solver is worse and how similar it is to existing solvers. In particular, 
our approach enables ranking the new solver amidst a set of existing solvers. 
For this ranking, we do not even need to predict exact solver runtimes, which 
is trickier. Further, we optimize the runtime that our strategy needs to arrive 
at its conclusion. We use instance and runtime features. Moreover, we select 
instances non-randomly and incrementally. In particular, we consider runtime 
information from already done experiments when choosing the next. By doing so, 
we can control the properties of the benchmarking approach, such as its required 
runtime. Our approach is scalable in that it ranks a new solver â among any 
number of known solvers A. In particular, we only subsample the benchmark 
once instead of comparing pairwise against each other solver [21]. 

We evaluate our approach with the SAT Competition 2022 Anniversary Track 
dataset [2], consisting of 5355 instances and runtimes of 28 solvers. We perform 
cross-validation by treating each solver once as the new solver and learning to 
predict the PAR-2 rank of that solver. On average, our predictions reach about 
92% accuracy with only about 10% of the runtime required to evaluate these 
solvers on the complete set of instances. 

Our entire source code! and experimental data? are available on GitHub. 


2 Related Work 


Benchmarking is not only of high interest in many fields but also an active 
research area on its own. Recent studies show that benchmark selection is chal- 
lenging for multiple reasons. Biased benchmarks can easily lead to fallacious in- 
terpretations [7]. Benchmarking also has many interchangeable parts, such as the 
performance measures used, how measurement points are aggregated, and how 
missing values are handled. Questionable research practices could alter these ele- 
ments a-posteriori to meet expectations, thereby skewing the results [27]. In the 
following, we discuss related work from the areas of static benchmark selection, 
algorithm configuration, incremental benchmark selection, and active learning. 
Table 1 compares the most relevant approaches, which all pursue slightly differ- 
ent goals. Thus, our approach is not a general improvement over the others but 
the only one fully aligned with Definition 1. 


1 https: //github.com /mathefuchs /al-for-sat-solver- benchmarking 
? https: //github.com/mathefuchs /al-for-sat-solver- benchmarking-data 
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Table 1: Comparison of features of our benchmark-selection approach, the static 
benchmark-selection approach by Hoos et al. [15], the algorithm configuration 
system SMAC [16], and the active-learning approaches by Matricon et al. [21]. 


Feature Hoos [15] SMAC |16] Matricon [21] Our approach 
Ranking/Scoring y xX (v) v 
Runtime Minimization x v v v 
Incremental/Non-Random x x v v 
Scalability v v x v 


Static Benchmark Selection. Benchmark selection is essential for competi- 
tions, e.g., the SAT Competition. In such competitions, the organizers define 
the rules for composing the benchmarks. These selection strategies are primarily 
static, i.e., they do not depend on particular solvers to distinguish. Balint et al. 
provide an overview of benchmark-selection criteria in different solver competi- 
tions [1]. Froleyks et al. describe benchmark selection in recent SAT competi- 
tions [8]. Manthey and Mohle find that competition benchmarks might contain 
redundant instances and propose a feature-based approach to remove redun- 
dancy [20]. Misir presents a feature-based approach to reduce benchmarks by 
matrix factorization and clustering [24]. 

Hoos et al. [15] discuss which properties are most desirable when selecting 
SAT benchmark instances. The selection criteria are instance variety to avoid 
over-fitting, adapted instance hardness (not too easy but also not too hard), and 
avoiding duplicate instances. To filter too similar instances, they use a distance- 
based approach with the SATzilla features [37,38]. The approach does, however, 
not optimize for benchmark runtime and selects instances randomly, apart from 
constraints on the instance hardness and feature distance. 


Algorithm Configuration. Further related work can be found within the field 
of algorithm configuration [14,32], e.g., the configuration system SMAC [16]. 
Thereby, the goal is to tune SAT solvers for a given sub-domain of problem in- 
stances. Although this task is different from our goal, e.g., we do not need to 
navigate the configuration space, there are similarities to our approach as well. 
For example, SMAC also employs an iterative, model-based selection procedure, 
though for configurations rather than instances. An algorithm configurator, how- 
ever, cannot be used to rank/score a new solver since algorithm configuration 
solemnly seeks to find the best-performing configuration. Also, while using a 
model-based selection strategy to sample configurations, instance selection is 
made randomly, i.e., without building a model over instances. 


Incremental Benchmark Selection. Matricon et al. present an incremental 
benchmark selection approach [21]. Their per-set efficient algorithm selection 
problem (PSEAS) is similar to our New-Solver Problem (cf. Definition 1). Given 
a pair of SAT solvers, they iteratively select a subset of instances until the 
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Fig. 1: Types of machine learning (depiction inspired by Rubens et.al. [29]). 


desired confidence level is reached to decide which of the two solvers is better. 
The selection of instances depends on the choice of the solvers to distinguish. 
They calculate a scoring metric for all unselected instances, run the experiment 
with the highest score, and update the confidence. Their approach ticks off most 
of our desired features in Table 1. However, the approach only compares solvers 
binarily rather than providing a scoring. Thus, it is unclear how similar two given 
solvers are or on which instances they behave similarly. Moreover, a significant 
shortcoming is the lacking scalability with the number of solvers. Comparing only 
pairs of solvers, evaluating a new solver requires sampling a separate benchmark 
for each existing solver. In contrast, our approach allows comparing a new solver 
against a set of existing solvers by sampling only one benchmark. 


Active Learning. Prediction models in passive machine learning are trained 
on datasets with given instance labels (cf. Fig. 1a). In contrast, active learn- 
ing (AL) starts with no or little labeled data. It repeatedly selects interesting 
problem instances for which to acquire labels, aiming to gradually improve the 
prediction model (cf. Fig. 1b). AL methods are especially beneficial if acquiring 
labels is computationally expensive, like obtaining solver runtimes. Without AL 
methods, it is not obvious which instances to label and which not. On the one 
hand, we want to maximize the utility an instance provides to our model, i.e., 
rank prediction accuracy, and on the other hand, minimize the cost, i.e., pre- 
dicted runtime, associated with the instance’s acquisition. Thus, we strive for an 
accurate prediction model without having to label every data point. 

Rubens et. al. [29] survey active-learning advances. While synthesis-based AL 
methods [5,9,34] generate instances for labeling, pool-based methods [11,13,19] 
rely on a fixed set of unlabeled instances to sample from. Recent synthesis-based 
methods within the field of SAT solving show how to generate problem instances 
with desired properties [5,9]. This goal is, however, orthogonal to ours. While 
those approaches want to generate instances on which a solver is good or bad, 
we want to predict whether a solver is good or bad on an existing benchmark. 
Volpato and Guangyan use pool-based AL to learn an instance-specific algorithm 
selector [35]. Rather than benchmarking a solver’s overall performance, their goal 
is to recommend the best solver out of a set of solvers for each SAT instance. 
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Algorithm 1: Incremental Benchmarking Framework 


Input: Solvers A, Instances Z, Runtimes r : A x Z —> [0,7], Solver â 
Output: Predicted Score of â, Measured Runtimes R 


1 M € initModel (A, T, r, â) // cf. Section 3.1 
2 R 

3 while not stop (M) do // cf. Section 3.3 
4 e + select NextInstance (M) // cf. Section 3.2 
5 t < runExperiment (â, e) // Runs â on e with timeout T 
6 R4 RU{(e, t)} 

7 updateModel (M, R) // cf. Section 3.1 
8 sa < predictScore(M) // cf. Section 3.1 
9 return (sa, R) 


3 Active Learning for SAT Solver Benchmarking 


Algorithm 1 outlines our benchmarking framework. Given a set of solvers A, 
instances Z and runtimes r, we first initialize a prediction model M for the 
new solver â ¢ A (Line 1). The prediction model M is used to repeatedly 
select an instance (Line 4) for benchmarking â (Line 5). The acquired result 
is subsequently used to update the prediction model M (Line 7). When the 
stopping criterion is met (Line 3), we quit the benchmarking loop and predict 
the final score of â (Line 8). Algorithm 1 returns the predicted score of â as well 
as the acquired instances and runtime measurements (Line 9). 

Section 3.1 describes the underlying prediction model M and specifies how 
we may derive a solver ranking from it. We discuss criteria for selecting instances 
in Section 3.2. Section 3.3 concludes with possible stopping conditions. 


3.1 Solver Model 


The model M provides a runtime-label prediction function f : AxZI— R for 
all solvers A := AU {å}. This prediction function powers instance selection 
as described in Section 3.2. During model updates (Algorithm 1, Line 7), f is 
trained to predict a transformed version of the acquired runtimes R. We describe 
the runtime transformation in the subsequent section. The features described in 
Section 4.2 serve as the input to the model. Further, note that we build a new 
prediction model in each iteration since running experiments (Line 5) dominates 
the runtime of model training by magnitudes. Finally, we predict the score of 
the new solver @ with the prediction function f (Line 8). 


Runtime Transformation. For the prediction model M, we transform the 
real-valued runtimes into discrete runtime labels on a per-instance basis. For 
each instance e € Z, we use a clustering algorithm to assign the runtimes in 
{r(a,e) | a € A} to one of k clusters C),...,C;, such that the fastest runtimes 
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for the instance e are in cluster C4 and the slowest are in cluster C,_;. Timeouts 
T always form a separate cluster Ck. The runtime transformation function yz : 
AxT-— {1,...,k} is then specified as follows: 


ye(a,e) =j & r(a,e) EC; 


Given an instance e € Z, a solver a € A belongs to the 7;,(a, e)-fastest solvers on 
instance e. In preliminary experiments, we achieved higher accuracy for predict- 
ing such discrete runtime labels than for predicting raw runtimes. Research on 
portfolio solvers has also shown that discretization works well in practice [4,26]. 


Ranking Solvers. To determine solver ranks, we use the transformed runtimes 
yka, e) in the adapted scoring function sẹ : A —> [1,2 - k] as follows: 


1 1 1 2 ` qk(a,e) if Jk(a, e) =k 
. = — 3 7 = - 
Sk (a) IZ] dna e) Vk (a e) [p (a, e) otherwise ( ) 


I.e., we apply PAR-2 scoring, which is commonly used in SAT competitions [8], 
on the discrete labels. The scoring function sọ induces a ranking among solvers. 


3.2 Instance Selection 


Selecting an instance based on the model is a core functionality of our framework 
(cf. Algorithm 1, Line 4). In this section, we introduce two instance sampling 
strategies, one that minimizes uncertainty and one that maximizes information 
gain. Both strategies use the model’s label-prediction function f and are in- 
spired by existing work within the realms of active learning [30]. These methods 
require the model’s predictions to include probabilities for the k discrete runtime 
labels. Let f’ : A x Z > [0,1]" denote this modified prediction function. In the 
following, the set Ž C Z denotes the instances that have already been sampled. 


Uncertainty Sampling. The uncertainty sampling strategy selects the in- 
stance closest to the model’s decision boundary, i.e., we select the instance 
e €Z\T that minimizes U (e), which is specified as follows: 


at 1 tia 
U(e) 7 k A ae (â, e)n 


Information-Gain Sampling. The information-gain sampling strategy selects 
the instance with the highest expected entropy reduction regarding the runtime 
labels of the instance. To be more specific, we select the instance e € T \ T that 
maximizes IG(e), which is specified as follows: 
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Here, H(e) denotes the entropy of the runtime labels y(a,e) over all a € A and 
H(e,n) denotes the entropy of these labels plus n as the runtime label for å. 
The term H,,(e) is computed for every possible runtime label n € {1,...,k}. 
By maximizing information gain, we select instances that identify solvers with 
similar behavior. 


3.3 Stopping Criteria 


In this section, we present the two dynamic stopping criteria in our experiments, 
the Wilcoxon and the ranking stopping criterion (cf. Algorithm 1, Line 3). 


Wilcoxon Stopping Criterion. The Wilcoxon stopping criterion stops the 
active-learning process when we are confident enough that the predicted run- 
time labels of the new solver are sufficiently different from existing solvers. This 
criterion is loosely inspired by Matricon et. al. [21]. We use the average p-value 
W, of a Wilcoxon signed-rank test w(5S, P) of the two runtime label distributions 
S = {y(a,e) | e € T} for an existing solver a and P = {f (â,e) | e € T} for the 


new solver â: 
Wa LT wl 
ai acA 


To improve the stability of this criterion, we use an exponential moving average 


to smooth out outliers and stop as soon as Ww drops below a fixed threshold: 


WQ) :=1 
= a-1 
WY) = = BWa F (1 = b) ws 


Ranking Stopping Criterion. The ranking stopping criterion is less sophisti- 
cated in comparison. It stops the active-learning process if the ranking induced by 
the model’s predictions (Equation 1) remained unchanged within the last / iter- 
ations. However, the concrete values of the predicted score sa might still change. 
We are solemnly interested in the induced ranking in this case. 


4 Experimental Design 


Given all the previously presented instantiations for Algorithm 1, this section 
outlines our experimental design, including our evaluation framework, used data 
sets, hyper-parameter choices, and implementation details. 


4.1 Evaluation Framework 


As stated in the Introduction, this work addresses the New-Solver Problem 
(cf. Definition 1). As described in Section 3.1, a prediction model M provides 
us with an estimated scoring sq for the new solver â. 
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Algorithm 2: Evaluation Framework 


Input: Solvers A, Instances Z, Runtimes r : A x Z — [0,7] E 
Output: Average Ranking Accuracy Oacc, Average Fraction of Runtime Ort 


10¢0 


2 for â€ Ado 
3 A’ + A\ {a} 


4 (sa, R) + runALAlgorithm(A’, T, r, â) // Refer to Algorithm 1 
// Determine Ranking Accuracy 

5 Oacc + 0 

6 for a € Ado 

7 if (sx (a) — sa) - (pare(a) — par2(â)) > 0 then 

8 | Oace  Oace + Tay 


// Determine Runtime Fraction 
9 r< J r(â,e) 
ect 
10 Ort +0 
11 for e € Z do 
12 if Jt, (e,t) E R then 
| £ Ort — Ort + Ł 


14 L O {— O U {(Oacc, Or) } 


15 (Oacc, Ort) + average(O) 
16 return (Oacc, Ort) 


To evaluate a concrete instantiation of Algorithm 1, i.e., a concrete choice 
for all the sub-routines, we perform cross-validation on our set of solvers. Algo- 
rithm 2 shows this. That means each solver plays the role of the new solver â 
once (Line 2). Note that the new solver in each iteration is excluded from the 
set of solvers A to avoid data leakage (Line 3). After running our active-learning 
framework for solver â (Line 4), we compute the value of both our optimiza- 
tion goals, i.e., ranking accuracy and runtime. We define the ranking accuracy 
Oace € [0,1] (higher is better) by the fraction of pairs (@,a) for all a € A that 
are decided correctly regarding the ground-truth scoring parz (Lines 5-8). The 
fraction of runtime that the algorithm needs to arrive at its conclusion is de- 
noted by O,, € [0,1] (lower is better). This metric puts the runtime summed 
over the sampled instances in relation to the runtime summed over all instances 
in the dataset (Lines 9-13). Finally, we compute averages of the output metrics 
in Line 15 after we have collected all cross-validation results in Line 14. Overall, 
we want to find an approach that maximizes 


O5 := Oacc + (1 — ô) (1 — Ort) , (2) 


whereby ô € [0, 1] allows for linear weighting between the two optimization goals 
Oacc and O,;. Plotting the approaches that maximize Os for all 6 € [0,1] on 
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an O;4-Oacc-diagram provides us with a Pareto front of the best approaches for 
different optimization-goal weightings. 


4.2 Data 


In our experiments, we work with the dataset of the SAT Competition 2022 
Anniversary Track [2]. The dataset consists of 5355 instances with respective 
runtime data of 28 sequential SAT solvers. We also use a database of 56 instance 
features? from the Global Benchmark Database (GBD) by Iser et al. [17]. They 
comprise instance size features and node distribution statistics for several graph 
representations of SAT instances, among others, and are primarily inspired by 
the SATzilla 2012 features described in [38]. All features are numeric and free of 
missing values. We drop 10 out of 56 features because of zero variance. Overall, 
prediction models have access to 46 instance features and 27 runtime features, 
i.e., excluding the current new solver â. 

Additionally, we retrieve instance-family information“ to evaluate the compo- 
sition of our sampled benchmarks. Instance families comprise instances from the 
same application domain, e.g., planning, cryptography, etc., and are a valuable 
tool for analyzing solver performance. 

For hyper-parameter tuning, we randomly sample 10% of the complete set 
of 5355 instances with stratification regarding the instances’ family. All instance 
families that are too small, i.e., 10% of them corresponds to less than one in- 
stance, are put into one meta-family for stratification. This tuning dataset allows 
for a more extensive exploration of the hyper-parameter space. 


4.3 Hyper-parameters 


Given Algorithm 1, there are several possible instantiations for the three sub- 
routines, i.e., ranking, selection, and stopping. Also, there are different choices 
for the runtime-label prediction model and runtime discretization. We describe 
these experimental configurations in the following. 


Ranking. Regarding ranking (cf. Section 3.1), we experiment with the following 
approaches and hyper-parameter values: 


— Observed PAR-2 ranking of already sampled instances 
— Predicted runtime-label ranking 
e History size: Consider the latest 1, 10, 20, 30, or 40 predictions within a 
voting approach for stability. The latest x predictions for each instance 
vote on the instance’s winning label. 
e Fallback threshold: If the difference of scores between the new solver â 
and another solver drops below 0.01, 0.05, or 0.1, use the partially 
observed PAR-2 ranking as a tie-breaker. 


3 https: //benchmark-database.de/getdatabase/base_db 
* https: //benchmark-database.de/getdatabase/meta_db 
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Selection. For selection (cf. Section 3.2), we experiment with the following 
methods and hyper-parameter values. Since the potential runtime of experi- 
ments is by magnitudes larger than the model’s update time, we only consider 
incrementing our benchmark by one instance at a time rather than using batches, 
which is also proposed in current active-learning advances [31,34]. A drawback 
of this is the lack of parallel execution of runtime experiments. 


— Random sampling 
— Uncertainty sampling 
e Fallback threshold: Use random sampling for the first 0%, 5%, 10%, 
15%, or 20% of instances to explore the instance space. 
e Runtime scaling: Whether to normalize uncertainty scores per instance 
by the average runtime of solvers on it or use the absolute values. 
— Information-gain sampling 
e Fallback threshold: Use random sampling for the first 0%, 5%, 10%, 
15%, or 20% of instances to explore the instance space. 
e Runtime scaling: Whether to normalize information-gain scores per in- 
stance by the average runtime of solvers on it or use the absolute values. 


Stopping. For stopping decisions (cf. Section 3.3), we experiment with the 
following criteria and hyper-parameter values: 


— Subset-size stopping criterion, using 10% or 20% of instances 
— Ranking stopping criterion 
e Minimum amount: Sample at least 2%, 8%, 10%, or 12% of instances 
before applying the criterion. 
e Convergence duration: Stop if the predicted ranking stays the same for 
a number of sampled instances equal to 1% or 2% of all instances. 
— Wilcoxon stopping criterion 
e Minimum amount: Sample at least 2%, 8%, 10%, or 12% of instances 
before applying the criterion. 
e Average of p-values to drop below: 5%. 
e Exponential-moving average: Incorporate previous significance values by 
using an EMA with 8 = 0.1 or 8 = 0.7. 


Prediction model. Our experiments only use one model configuration for 
runtime-label prediction since an exhaustive grid search would be infeasible. In 
preliminary experiments, we compared various model types from scikit-learn [28]. 
In particular, we conducted nested cross-validation, including hyper-parameter 
tuning, and used Matthews Correlation Coefficient [12,22] to assess the perfor- 
mance for predicting runtime labels. Our final choice is a stacking ensemble [36] 
of two prediction models, a quadratic-discriminant analysis [33] and a random 
forest [3]. Both these models can learn non-linear relationships between the in- 
stance features and the runtime labels. Stacking means that another prediction 
model, in our case a simple decision tree, decides which of the two ensemble 
members makes the prediction on which instance. 
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Runtime discretization. To define prediction targets, i.e., discrete runtime 
labels, we use hierarchical clustering with k = 3 and a log-single-link criterion, 
which produced the most useful labels in preliminary experiments. We denote 
this adapted solver scoring function with s3. In our chosen hierarchical proce- 
dure, each non-timeout runtime starts in a separate interval. We then gradually 
merge intervals whose single-link logarithmic distance is the smallest until the 
desired number of partitions is reached. Other clustering approaches that we 
tried include hierarchical clustering with mean-, median-, and complete-link cri- 
terion, as well as k-means and spectral clustering. 

To obtain useful labels, we need to ensure that discretized labels still discrim- 
inate solvers and align with the actual PAR-2 ranking. We analyzed the ranking 
induced by s3 in preliminary experiments with the SAT Competition 2022 An- 
niversary Track [2]. According to a Wilcoxon-signed-rank test with a = 0.05, 
87.83% of solver pairs have significantly different scores after discretization, 
only a slight drop compared to 89.95 % before discretization. Further, our rank- 
ing approach correctly decides for almost all (about 97.45%; o = 3.68 %) solver 
pairs which solver is faster. In particular, the Spearman correlation of s3 and 
PAR-2 ranking is about 0.988, which is very close to the optimal value of 1 [6]. 
All these results show that discretized runtimes are suitable for our framework. 


4.4 Implementation Details 


For reproducibility, our source code and data are available on GitHub (cf. foot- 
notes in Section 1). Our code is implemented in PYTHON using scikit-learn [28] 
for making predictions and gbd-tools [17| for SAT-instance retrieval. 


5 Evaluation 


In this section, we evaluate our active-learning framework. First, we analyze and 
tune the different sub-routines of our framework on the tuning dataset. Next, 
we evaluate the best configurations with the full dataset. Finally, we analyze the 
importance of different instance families to our framework. 


5.1 Hyper-Parameter Analysis 


Our experiments follow the evaluation framework introduced in Section 4.1. 
Fig. 2 shows the performance of the approaches from Section 4.3 on O;t-Oace- 
diagrams for the hyper-parameter-tuning dataset. Evaluating a particular con- 
figuration with Algorithm 2 returns a point (Ort, Oacc). We do not show in- 
termediate results of the active-learning procedure but only the final results 
after stopping. The plotted lines represent the best-performing configurations 
per ranking approach (Fig. 2a), selection approach (Fig. 2b), and stopping crite- 
rion (Fig. 2c). In particular, we show the Pareto front, i.e., of all configurations 
that share a particular value of the plotted hyper-parameter, we take the maxi- 
mum ranking accuracy over all remaining hyper-parameters not displayed in the 
corresponding plot. 
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Fig. 2: Ort-Oace-diagrams comparing different hyper-parameter instantiations of 
our active-learning framework on the hyper-parameter-tuning dataset. The x- 
axis shows the ratio of total solver runtime on the sampled instances relative 
to all instances. The y-axis shows the ranking accuracy (cf. Section 4.1). Each 
line entails the front of Pareto-optimal configurations for the respective hyper- 
parameter instantiation. 


Active Learning for SAT Solver Benchmarking 419 


0.3 1.0 wwe 1.0 
eee 

n 8] 0.8 
: pP 
20.2 e | 
E g 3 0.6 0.6 
w ® <x 
a B ad e Bo ô 
8 - J 0.4 0.4 
Zoi ° i E 
= E 
= 2 0.2 

0.0 = = Oe r r = r — 0.0 

0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 
Fraction of Runtime Fraction of Runtime 
(a) Runtime vs. Instances (b) Runtime vs. Accuracy 


Fig. 3: Scatter plot comparing different instantiations of trade-off parameter ô 
for our active-learning framework on the hyper-parameter-tuning dataset. The 
x-axis shows the fraction of runtime O,; of the sample, while the y-axes show 
the fraction of instances sampled and ranking accuracy, respectively. The color 
indicates the weighting between different optimization goals 6 € [0, 1]. The larger 
6, the more we favor accuracy over runtime. 


Regarding ranking approaches (Fig. 2a), using the predicted s3-induced run- 
time-label ranking consistently outperforms the partially observed PAR-2 rank- 
ing for each possible value of the trade-off parameter 6. This outcome is expected 
since selection decisions are not random. For example, we might sample more 
instances of one family if it benefits discrimination of solvers. While the partially 
observed PAR-2 score is skewed, the prediction model can account for this. 

Regarding the selection approaches (Fig. 2b), uncertainty sampling performs 
best in most cases. However, information-gain sampling is beneficial if runtime is 
strongly favored (small 6; runtime fraction less than 5%). This result aligns with 
our expectations: Information-gain sampling selects instances that maximize the 
expected reduction in entropy. This means we sample instances revealing simi- 
larities between solvers rather than differences, which helps to build a confident 
model quickly. However, the method cannot select helpful instances for distin- 
guishing solvers later. Random sampling performs reasonably well but is out- 
performed by uncertainty sampling in all cases, showing the benefit of actively 
selecting instances based on a prediction model. 

Regarding the stopping criteria (Fig. 2c), the ranking stopping criterion per- 
forms most consistently well. If accuracy is strongly favored (very high ô), the 
Wilcoxon stopping criterion performs better. The subset-size stopping criterion 
performs reasonably well but does not improve beyond a certain accuracy be- 
cause of sampling a fixed subset of instances. 

Fig. 3a shows an interesting consequence of weighting our optimization goals: 
If we, on the one hand, desire to get a rough estimate of a solver’s performance 
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Table 2: Performance comparison (on the full dataset) of the best-performing 
active-learning approaches (AL), random sampling of the same runtime frac- 
tion with 1000 repetitions (Random), and statically selecting the instances most 
frequently sampled by active-learning approaches (Most Freq.) 


(a) Best-performing AL approach for 6 € [0.2, 0.7] 


AL Random Most Freq. 
Sampled Runtime Fraction (%) 5.41 5.43 5.44 
Sampled Instance Fraction (%) 26.53 5.43 27.75 
Ranking Accuracy (%) 90.48 88.54 81.08 


(b) Best-performing AL approach for 6 € (0.7, 0.8] 


AL Random Most Freq. 
Sampled Runtime Fraction (%) 10.35 10.37 10.37 
Sampled Instance Fraction (%) 5.24 10.37 36.96 
Ranking Accuracy (%) 92.33 91.61 84.52 


fast (low 6), approaches favor selecting many easy instances. In particular, the 
fraction of sampled instances is larger than the fraction of runtime. By having 
many observations, it is easier to build a model. If we, on the other hand, desire 
to get a good estimate of a solver’s performance in a moderate amount of time 
(high 6), approaches favor selecting few, difficult instances. In particular, the 
fraction of instances is smaller than the fraction of runtime. 

Furthermore, Fig. 3b reveals which values make the most sense for ô. The 
range ô € [0.2,0.8], thereby, corresponds to the points with a runtime fraction 
between 0.03 and 0.22 We consider this region to be most promising, analogous 
to the elbow method in cluster analysis [18]. 


5.2 Full-Dataset Evaluation 


Having selected the most promising hyper-parameters, we run our active-learning 
experiments on the complete Anniversary Track dataset (5355 instances). The 
aforementioned range 6 € [0.2,0.8] only results in two distinct configurations. 
The best-performing approach for ô € [0.2, 0.7] uses the predicted runtime-label 
ranking, information-gain sampling, and ranking stopping criterion. It can pre- 
dict a new solver’s PAR-2 ranking with 90.48% accuracy (Oacc) in only 5.41% 
of the full evaluation time (O,,). The best-performing approach for 6 € (0.7, 0.8] 
uses the predicted runtime-label ranking, uncertainty sampling, and ranking 
stopping criterion. It can predict a new solver’s PAR-2 ranking with 92.33% 
accuracy (Oacc) in only 10.35 % of the full evaluation time (O,1). 

Table 2 shows how both active-learning approaches (column AL) compare 
against two static baselines: Random samples instances until it reaches roughly 
the same fraction of runtime as the AL benchmark sets. We repeat sampling 
1000 times and report average results. Most Freq. uses a static benchmark set 
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Fig. 4: Scatter plot showing the importance of different instance families to our 
framework on the full dataset. The x-axis shows the frequency of instance families 
in the dataset. The y-axis shows the average frequency of instance families in 
the samples selected by active learning. The dashed line represents families that 
occur with the same frequency in the dataset and samples. 


consisting of those instances most frequently sampled by our active learning 
approach. In particular, we consider the average sampling frequency over all 
solvers and Pareto-optimal active-learning approaches. 

Both our AL approaches perform better than random sampling. However, 
the performance differences are not significant regarding a Wilcoxon signed- 
rank test with a = 0.05 and also depend on the fraction of sampled runtime 
(cf. Fig. 2b). A clear advantage of our approach is, though, that it indicates 
when to stop adding further instances, depending on the trade-off parameter ô. 
While the active-learning results are less strong on the full dataset than on the 
smaller tuning dataset, they still show the benefit of making benchmark selection 
dependent on the solvers to distinguish. 

A static benchmark using the most frequently AL-sampled instances per- 
forms poorly, though, compared to active learning and random sampling. This 
outcome is somewhat expected since the static benchmark does not reflect the 
right balance of instance families: Families whose instances are uniform-randomly 
selected by AL, e.g., for different solvers, appear less often in this benchmark 
than families where some instances are sampled more often than others. 


5.3 Instance-Family Importance 


Selection decisions of our approach also reveal the importance of different in- 
stance families to our framework. Fig. 4 shows the occurrence of instance fami- 
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lies within the dataset and the benchmarks created by active learning. We use 
the best-performing configurations for all 6 € [0,1] and examine the selection 
decisions by the active-learning approach on the SAT Competition 2022 Anniver- 
sary Track dataset [2]. While most families appear with the same fraction in the 
dataset and the sampled benchmarks, a few outliers need further discussion. 
Problem instances of the families fpga, quasigroup-completion, and planning are 
especially helpful to our framework in distinguishing solvers. Instances of these 
families are selected over-proportionally in comparison to the full dataset. In 
contrast, instances of the largest family, i.e., hardware-verification, roughly ap- 
pear with the same fraction in the dataset and the sampled benchmarks. Finally, 
instances of the family cryptography are less important in distinguishing solvers 
than their vast weight in the dataset suggests. A possible explanation is that 
these instances are very similar, such that a small fraction of them is sufficient 
to estimate a solver’s performance on all of them. 


6 Conclusions and Future Work 


In this work, we have addressed the New-Solver Problem: Given a new solver, 
we want to find its ranking amidst competitors. Our approach provides accu- 
rate ranking predictions while needing significantly less runtime than a complete 
evaluation on a given benchmark set. On data from the SAT Competition 2022 
Anniversary Track, we can determine a new solver’s PAR-2 ranking with about 
92% accuracy while only needing 10% of the full-evaluation time. We have eval- 
uated several ranking algorithms, instance-selection approaches, and stopping 
criteria within our sequential active-learning framework. We also took a brief 
look at which instance families are the most prevalent in selection decisions. 

Future work may compare further sub-routines for ranking, instance selec- 
tion, and stopping. Additionally, one can apply our evaluation framework to 
arbitrary computation-intensive problems, e.g., other MP-complete problems 
than SAT, as all discussed active-learning methods are problem-agnostic. Such 
problems share most of the relevant properties of SAT solving, i.e., there are es- 
tablished instance features, a complete benchmark is expensive, and traditional 
benchmark selection requires expert knowledge. 

From the technical perspective, one could formulate runtime discretization 
as an optimization problem rather than addressing it empirically. Further, a 
major shortcoming of our current approach is the lack of parallelization, selecting 
instances one at a time. Benchmarking on a computing cluster with n cores 
benefits from having batches of n instances. However, bigger batch sizes n impede 
active learning. Also, it is unclear how to synchronize instance selection and 
updates of the prediction model without wasting too much runtime. 


Acknowledgments. This work was supported by the Ministry of Science, Re- 
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Abstract. Over the last years, innovative parallel and distributed SAT 
solving techniques were presented that could impressively exploit the 
power of modern hardware and cloud systems. Two approaches were par- 
ticularly successful: (1) search-space splitting in a Divide-and-Conquer 
(D&C) manner and (2) portfolio-based solving. The latter executes differ- 
ent solvers or configurations of solvers in parallel. For quantified Boolean 
formulas (QBFs), the extension of propositional logic with quantifiers, 
there is surprisingly little recent work in this direction compared to SAT. 
In this paper, we present PARAQOOBA, a novel framework for parallel 
and distributed QBF solving which combines D&C parallelization and 
distribution with portfolio-based solving. Our framework is designed in 
such a way that it can be easily extended and arbitrary sequential QBF 
solvers can be integrated out of the box, without any programming effort. 
We show how PARAQOOBA orchestrates the collaboration of different 
solvers for joint problem solving by performing an extensive evaluation 
on benchmarks from QBFEval’22, the most recent QBF competition. 


1 Introduction 


Quantified Boolean formulas (QBFs) extend propositional logic by quantifiers 
over the Boolean variables [2]. As a consequence, the decision problem of QBF 
(QSAT) is PSPACE complete, which is potentially harder than the NP-complete 
decision problem of propositional logic (SAT). Hence, the quantifiers allow for 
an efficient encoding of many reasoning problems from formal verification, syn- 
thesis, and planning [26] that most likely do not have a compact formulation 
in propositional logic. Over the last decade, considerable progress has been 
made in sequential QBF solving [22,21]. In contrast to SAT, where conflict- 
driven clause learning (CDCL) [19] is the predominant solving paradigm, in 
QBF solving different approaches of orthogonal strength have been presented. 
Besides QCDCL, the QBF variant of CDCL, which is implemented for example 
in the solver DEPQBF [17], clausal abstraction as implemented in the solver 
CAQE [23] and abstraction-refinement based expansion as implemented in the 
solver RAREQS [13] are particularly successful [22,21]. All of these QBF solving 
approaches considerably benefit from preprocessing, i.e., an extra step before 
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the actual solving in which certain redundancies of a formula are eliminated in 
a satisfiability-preserving way with the aim to make it easier for the solver [10]. 

Despite the vivid development in sequential QBF solving, only few approaches 
have been presented for parallel and distributed QBF solving [18]. The most 
recent parallel QBF solvers are HORDEQBF [1] which integrates sequential 
QCDCL-based solvers to obtain a parallel QBF solver and, more recently, a 
basic implementation of a QBF module based on the parallel SAT solver PARA- 
Coosa [6] with DEPQBF as its only backend solver. To the best of our knowl- 
edge, besides these two approaches no other parallel QBF solver has recently 
been presented. The situation in SAT is different: several very powerful parallel 
and distributed SAT solvers like MALLOB [24], PAINLESS [5], and the afore men- 
tioned solver PARACOOBA |7| have been released. They show the potential of 
parallel and distributed approaches impressively by solving hard SAT instances, 
for example from multiplier verification [15]. 

In this paper, we present PARAQOOBA, a novel framework for parallel and 
distributed QBF solving that integrates search-space splitting based on the 
Divide-and-Conquer paradigm with portfolio solving. Our framework is built 
on top of the PARACOOBA SAT solving framework and extends its basic non- 
portfolio QBF solving module. PARAQOOBA reuses most of PARACOOBA’s mod- 
ules providing management and distribution of solver tasks. In addition, we im- 
plemented a very generic interface that allows the easy integration of any QBF 
solver binary into our framework. 

Our main contributions are as follows: 


— we present a new flexible framework for parallel and distributed QBF solving 
that combines D&C search-space splitting with portfolio solving; 

— we show how different QBF solvers that are based on different solving ap- 
proaches can be integrated seamlessly into our framework; 

— we provide our framework as open-source project; 

— we perform an extensive evaluation that demonstrates the power of our ap- 
proach on various kinds of benchmarks. 


PARAQOOBA is integrated into PARACOOBA’s and available on GitHub: 
https: //github.com/maximaximal/paracooba 


This paper is structured as follows: First we introduce some preliminaries re- 
quired for the rest of the paper in the following section. We continue with related 
work in section 3. After that, section 4 summarizes concepts of the PARACOOBA 
solver framework used in our work. Then we introduce how we apply Divide- 
and-Conquer to solving QBF in section 5. Having introduced the background, 
we present our portfolio PARAQOOBA module in detail in section 6 and provide 
an extensive evaluation in section 7. Finally, we summarize our findings and 
conclude in section 8. 
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2 Preliminaries 


We consider QBFs Q.y in prenex conjunctive normal form (PCNF) where the 
prefix Q is of the form Q121,..-,Qn@n with Q € {V, 3}. The matriz p is a propo- 
sitional formula over the variables 71, ..., 2%», in conjunctive normal form (CNF). 
A formula in CNF is a conjunction (^) of clauses. A clause is a disjunction (V) 
of literals. A literal is a variable x, a negated variable ~x or a (possibly negated) 
truth constant T (true) or L (false). For a literal J, the expression J denotes x 
if l = ma and it denotes ~x otherwise. We sometimes write a clause as a set of 
literals and a CNF formula as set of clauses. Further, it is often convenient to 
partition the quantifier prefix into quantifier blocks, i.e., maximal sets of consec- 
utive sets of variables with the same quantifier type. For example, for the QBF 
VayWx2dyidy2.p we also write VXAY.p with X = {1,22} and Y = {y1, yo}. 
With upper case letters X,Y,... (possibly subscripted), we usually denote sets 
of variables, while with lower case letters 7, y,... (also possibly subscripted), we 
denote variables. If y is CNF formula, then yz; is the CNF formula obtained 
from y by replacing all occurrences of variable x by truth constant t € {T, L}. 
Depending on the value of t, variable x is either set to true (if t is T) or to false 
(if t is L). We define the semantics of QBFs as follows: 


— a QBF VX Q.y» is true iff both QBFs VX/O.y,._, and VX'O.yz_7 are true 
where x € X and X’ = X \ {a}; 

— a QBF JY Q.y is true iff at least one of SY’Q.y,, 1 and SY’O.py,._7 is true 
where y € Y and Y’ = Y \ {y}. 


Note that we assume that all variables of a QBF are quantified, i.e., we are 
considering closed formulas only. Further, we use standard semantics of con- 
junction, disjunction, negation, and truth constants. For example, the QBF 
Qı = Yx3y.((x V y) A (“z V ny )) is true, while ¢2 = JyYzx.((x V y) A (“x V 7y)) is 
false. As we see already by this small example, the semantics impose an ordering 
on the variables w.r.t. the prefix. Given a QBF Q.ọ, we say that £x <g y iff x 
occurs before y in the prefix. If clear from the context, we write x < y. In qı, 
we have x < y, while in ġ2, we have y < x. 


3 Related Work 


In practical QBF solving, attempts to parallelize and distribute QBF solvers 
have a long history (cf. [18] for a survey). Already more than 20 years back, the 
first distributed QBF solver PQSOLVE [4] was presented, in a time when QCDCL 
had not been invented yet. With the advent of QCDCL, several attempts have 
been made to build parallel QCDCL solvers and implement knowledge-sharing 
mechanisms for learned clauses and cubes. One example of such a solver is 
PAQUBE [16]. Unfortunately, the code of most of the early approaches is not 
available anymore. Following the success of Cube-and-Conquer-based_ search- 
space splitting, the QBF solver MPIDEPQBF has been presented [14]. While 
MPIDEPQBF does not implement any sophisticated look-ahead mechanisms, 
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it could demonstrate that even without knowledge-sharing considerable speedup 
could be achieved. These results serve as motivation for the approach presented 
in this paper. Unfortunately, MPIDEPQBF is implemented in an older version 
of OCaml that does not run on recent systems and relies on now deprecated li- 
braries, making a comparison impossible. As indicated by its name, it is tailored 
around the sequential QBF solver DEPQBF [17]. Another recent MPI-based 
QBF solver is HORDEQBF [1] which implements knowledge sharing for QCDCL 
solvers. It is designed in such a way that it allows the integration of any QCDCL 
solver. In order to integrate a solver, it requires that it implements a certain in- 
terface, i.e., programming effort is necessary to add a new solver. To the best of 
our knowledge, it includes the QBF solver DEPQBF only. HORDEQBF does not 
perform search-space splitting, but it is a parallel portfolio solver with clause- 
and cube sharing. It diversifies the parallel solver instances by different param- 
eter settings. This is different than in sequential portfolio solvers as presented 
in [12], which select among different solvers based on some properties of the input 
formula. Overall, a very strong focus on QCDCL-based solvers can be observed 
for parallel QBF solving frameworks. Because of this, many chances for better 
solving performance are missed, as nowadays there are many other solvers of 
orthogonal strength. With PARAQOOBA we provide a simple way of exploiting 
the power of the different solving approaches without any integration effort. 


4 PARACOOBA 


Our novel framework PARAQOOBA (with q in the middle of its name) builds on 
top of the SAT solver PARACOOBA (with c in the middle of its name). In this 
section, we describe the parts of PARACOOBA that are relevant for the remainder 
of this work for our extension of PARACOOBA to PARAQOOBA. 

PARAQOOBA will be made available publicly during the artifact evaluation 
under the MIT license, similar to PARACOoBA [7,6] which is publicly available 
on GitHub also under the MIT license?. PARACOOBA is a distributed Cube- 
and-Conquer (C&C) solver that implements a proprietary peer-to-peer based 
load balancing protocol. In contrast to standard D&C solvers the splitting of 
the search-space can both be done upfront by using a look-ahead solver that 
produces n cubes or online during solving by lookahead or other heuristics. 
Amongst other information, the cubes are stored in a binary tree, the solve tree. 


Solver module. A solver module manages the sequential solver that is responsible 
for solving a subproblem. Different solver modules have different code-bases, 
but they also generally share common concepts. A solver module implements a 
parser task, which is created directly after the module was initiated and serves 
as its starting point. It parses the input formula in its own worker thread and 
instantiates a solver manager based on the fully parsed formula. The parser task 
also creates the first solver task as the root of the solve tree. 


3 github.com/maximaximal/Paracooba 
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Solver Tasks. For PARACOOBA, solver tasks are paths in the solve tree, whith a 
parser task being used to generate the tree’s root. Solver tasks are usually started 
as children of other tasks, saving references to their parents, with the root solver 
task being the only exception. A task’s depth in the solve tree represents its 
priority to be worked on: The greater the depth, the more important a task 
is to be solved locally and the less important it is to be offloaded to other 
compute nodes by the broker module. Only tasks that were created locally may 
be distributed. 


Broker module. The broker module handles relations between solver tasks and 
processes their results. While the solver module generates tasks, the broker sched- 
ules them based on their priorities (their depths) and offloads them if a different 
compute node has less load than the current node. A task result is propagated 
upwards across compute nodes, there is no conceptual difference between locally 
and remotely solved tasks. The broker module is generic and does not rely on a 
specific solver module, instead providing the environment a solver module works 
in. It is already provided by PARACOOBA and stays the same for different solver 
modules. 


Cube Sources. For generating concrete subproblems, cube sources provide as- 
sumption literals to leaf solver tasks. A cube source decides whether a given 
solver task should split again, based on the current configuration (mainly the 
splitting depth) and the given formula. Every solver module can implement 
its own cube source, hence there are different kinds of cube sources for differ- 
ent solver modules. On this basis, very flexible mechanisms for the selection of 
splitting variables can be implemented, ranging from a simple count of literal 
occurrences to advanced look-ahead heuristics. 


Task Tree. The task tree built lazily, i.e., only once a leaf is visited, the leaf is 
either expanded into a sub-tree, or solved. We picture such a tree in Figure 1. 
This tree has a depth of 1, because the path from the tree’s root solver task 
to the leaf solver tasks has a length of 1. Once the active cube source stops 
further splits from being carried out, the tree’s maximum depth is reached. The 
worker thread currently executing a task then lends a solver instance from the 
solver manager’s central store. Each solver instance is created on-the-fly once 
(normally initialized based on the parser task) for each worker thread, which 
can also happen for multiple worker threads in parallel. After a solver instance 
was created, all other tasks solved by the same worker thread use the same solver 
instance. 


Guiding Paths. The cubes that are given to solver instances as assumptions are 
called guiding paths. They are generated from the path to the leaf being solved. 
The solver instance then handles the solving internally, blocking the worker 
thread until either result is generated or the task is terminated. Results are 
not returned to parents, but instead handled by the broker module, which then 
traverses the solve tree upwards as far as possible, based on the results already in 
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the tree. Different kinds of evaluations can be defined on every level using a user- 
defined assessment function. With the result processed by the broker module, 
the solver task then finishes and the worker thread can take on the next task, 
based on the next-highest priority. The broker may delete the solver task after it 
finished processing, if the result was already used somewhere above it in the tree 
and no information from the original solver task structure is required anymore. 
Once the broker module has enough information to solve the root task, the result 
of the formula was computed successfully. 


Solver Handle. A solver handle wraps instances of a given solver. It must be able 
to receive an Assume event, directly followed by a Solve event. While processing 
these events, a correctly working handle must block its calling thread until a 
result is found. Additionally, it must be fully re-entrant after finishing processing, 
so that the next solver task can apply new assumptions. On top of this, a handle 
must also be able to process a Terminate event, stopping the solver and early- 
returning control to its calling thread. Such a termination event may happen 
at any time, as it is generated by other solver tasks. This possibility of random 
terminations was an issue for our extension to PARAQOOBA, as it complicated 
synchronization of all involved threads. 


QBF Solver Module. PARACOOBA already provided a basic QBF solver module 
similar to the approach seen in MPIDEPQBF. It implemented a QDIMACS- 
parser in a new solver module based on the SAT module. It realizes a simple 
cube source that returns the variable at the nth position in the prefix, with 
n being the current depth of a solver task. The solve tree is built using two 
adapted assessment functions: one for variables quantified V (requiring all sub- 
trees to be true), one for 3 (requiring at least one sub-tree to be true). The 
assessment functions also use PARACOOBA’s cancellation-support to terminate 
unneeded siblings after results already satisfy the respective subproblem. As 
backend solver, it exclusively uses DEPQBF that provides an incremental API 
(which no other recent solver provides, to the best of our knowledge). 


Summary. With its already existing tree-based QBF solving module together 
with its support for distributed solving, PARACOOBA provides a stable basis 
for building an advanced parallel QBF solver. While the existing QBF module 
is rather uncompetitive with a few exceptions that indicate its potential, its 
core infrastructure turned out to be very useful to build our novel framework 
PARAQOOBA that offers built-in portfolio support. 


The networking support mentioned above enables combining multiple com- 
pute nodes by giving each peer a connection to the main node. This is achieved 
with setting the --known-remote option. With this feature it becomes possible 
to easily distribute larger problem instances on a cluster or in the cloud. 
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5 Architecture of PARAQOOBA: Combining 
Divide-and-Conquer Portfolio Solving 


Our framework PARAQOOBA combines Divide-and-Conquer (D&C) search space 
splitting with portfolio solving. The key feature of PARAQOOBA compared to 
PARACOOBA is to allow portfolio solving at different search depths. The idea is 
illustrated in Figure 1. Both approaches are widely used to realize parallel and 
distributed SAT and QBF solvers. The D&C approach has been especially suc- 
cessful for hard combinatorial SAT problems [11] in a variant called Cube-and- 
Conquer (C&C). The C&C approach relies on powerful, but expensive lookahead 
solvers that heuristically decide which variables shall be considered for splitting. 
In its original SAT version, PARACOOBA builds upon this idea [7]. 

For a QBF QıXQ-Y Q.y with Qı # Qz and Q1, Q2 € {VY, 3} though, the 
possible choices for variable selection are more restricted because of the quantifier 
prefix. In general, only variables from the outermost quantifier block Q1 X may 
be considered, because otherwise, the value of the formula might change. Jordan 
et al. [14] observed that for QBF following the sequential order of the variables 
in the first quantifier block already leads to improvements compared to the 
sequential implementation of DEPQBF. The already existing QBF solver module 
of PARACOOBA (see section 4) relied on this observation: it traverses the prefix of 
a PCNF and splits each visited leaf into two sub-trees, respecting both universal 
and existential quantifiers, until a pre-defined maximum depth is reached. Hence, 
it re-implements the approach of MPIDEPQBF in PARACOOBA. 

Our framework PARAQOOBA generalizes the previous QBF module of PARA- 
COOBA not only by generalizing the interface in such a manner that any QBF 
solver can be easily (without programming effort) integrated as backend solver. 
Now it is also possible to run several solvers in the leaves as shown in Figure 2 
for one split. Overall, PARAQOOBA realizes the following approach. The search- 
space is split according to the variable ordering of the prefix until a given depth. 
Once one of the sub-trees of an existentially quantified variable split is found to 
be true, the other sibling is terminated. Only when both siblings return false, 
the whole split returns false. Universal splits work in a dual manner: the result 
is only true if both sub-trees are found to be true and false otherwise. This 
property of QBF enables efficient termination of sub-tasks. 

In PARAQOOBA, we now also parallelize each solver call over several QBF 
solvers with orthogonal strategies. Compared to prior approaches [18], we run 
a portfolio of multiple solvers in the leaves of the solve tree instead of only 
parallelizing its root. Having just one tree leads to several advantages: We are 
more flexible and may also call a preprocessor (e.g. BLOQQER) before each solve 
call. We also only instantiate the tree once, saving memory and enabling early- 
termination of sibling solver tasks. 


6 Implementation 


This section describes the extension of the SAT solver PARACOOBA (for an 
overview see section 4) to our QBF solving framework PARAQOOBA. As PARA- 
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Q1zQ2y9.o 


zel- Qn € {V, 5} ae l splitting 


solving 


Solver 1 || Solver 2 Solver 1 || Solver 2 


Fig. 1: Divide-and-Conquer with arbitrary-many levels of splitting and sub- 
formulas on the leaves solved by a portfolio of different sequential solvers 


COOBA was originally not designed for portfolio support, several modifications 
and extensions were necessary. To this end, we first present the new QBF module 
of PARAQOOBA followed by a discussion of novel search-space pruning facilities. 


6.1 The PARAQOOBA QBF Module 


We generalized the already existing QBF solver handle to become an abstract 
base class, which now can be either a single solver handle or a portfolio handle. 
The latter unifies multiple handles into one, emulating a blocking and re-entrant 
interface. Once a portfolio handle is initialized, it starts one thread per internally 
wrapped handle. Each such thread implements a small state machine, waiting 
for events on a shared queue. Once the portfolio handle receives an assumption 
(a temporary truth assignment of a variable for one solver call), it is forwarded 
to all internal threads and is worked on by each wrapped solver in parallel. 

If a portfolio handle was terminated before a solve call was issued, the internal 
handles would enter an invalid state. To circumvent this situation, an assumption 
event also directly triggers the internal state machine to continue into the solve 
state. Once the solve request actually arrives, it is just translated to an empty 
event, which, after it finished processing, indicates that a result was computed. 
A termination event is forwarded to the internal solver handles, but is limited 
to only one event per solve cycle. 


PARAQOOBA -> QBF Module 
\ ~y v 

ia mi QBF Solver Task(s) 
Solver 1 || ...|| Solver n 

Worker 2 

Solver 1 || ... || Solver n 

Worker n 

Solver 1 || ...|| Solver n 


Fig. 2: The PARAQOOBA framework 
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The first internal solver handle to compute a result returns and sends a 
termination event to all sibling solvers. The result is saved and the portfolio 
handle waits for all internal handles to be ready to receive the next assumption, 
i.e., returning all solvers to a known state. Once every internal handle has reached 
that, the portfolio handle finally returns to its calling thread, forwarding the 
result of the inner handle. Because of thread scheduling and fast solving of trivial 
subproblems, a result can be forwarded even before the other sibling has been 
started, letting the broker module already complete a task before it itself has 
created both child tasks. This effect lead to some issues and had to be mitigated 
by adding some conditions on a task already being terminated even though it 
did not yet run to completion. Because a task will only be scheduled after the 
initial call to its assessment function, not many such checks were needed. 

As many QBF solvers lack APIs, we have to work with their binaries that 
generally only read QDIMACS files. For this, we use the QUAPI interfacing 
library, that adds well-performing assumption-based reasoning support to generic 
solver binaries [9]. By not relying on specialized modifications of a solver’s source 
code, we are able to plug-in generic third-party solvers, completely composable 
at runtime. Our PARAQOOBA module provides the --quapisolver parameter, 
that either directly specifies the leaf solver to be used, or automatically generates 
a portfolio handle to wrap multiple parallel leaf solvers. Note that our approach 
works for QBFs starting with existential as well as with universal quantification. 

In its standard configuration, PARAQOOBA returns whether a given instance 
is found to be true or false. When enabling trace output using -t, it also supports 
printing the specific solver and the subproblem (including its guiding path) that 
produced a result. Using this machinery, one obtains an environment to experi- 
ment with benchmarks and to see how multiple solvers complement each other 
for the generated sub-formulas. The trace output is also useful when fully ex- 
panding a QBF formula by specifying a tree-depth of -1. While not advised for 
any real formulas, this was a well-received debugging aid for stress-testing new 
features. The opposite to this can also be done, by applying a tree-depth of 0. 
This directly solves the root task, without splitting the formula. This was also 
how the configuration PQ Portfolio with depth 0 (as discussed in the experimen- 
tal evaluation below) was executed. 


6.2 Search-Space Pruning 


Preprocessing in the leaves. We modified the QBF preprocessor BLOQQER to 
allow forwarding output directly into a given solver binary by adding a -p argu- 
ment. Internally, this writes the complete formula with added assumptions into 
the standard input of BLOQQER’s preprocessing pipeline. 

To plug e.g. CAQE into such a processing chain and then into PARAQOOBA, 
one may use our QBF solver module’s command line option --quapisolver 
bloqqer-popen@-p=cage. Deferring preprocessing until solving the leaves pre- 
serves the original formula structure of a formula during the split phase. We 
discuss the effects of this later in subsection 7.4. 
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Integer-Split Reduction. In many planning and verification encodings, the vari- 
ables of a quantifier block QX are interpreted as bitvectors representing m nodes 
of a graph. Assume that n = |X| bits with m < 2” are used for modeling the 
states of the graph. Then 2” — m assignments to X are not relevant, but as a 
solver is agnostic of this information, it has to consider all assignments. 

If m is known to the user, PARAQOOBA can be called with the option 
--intsplit (once or multiple times, once for each layer). One integer-split is 
counted as one layer in the task tree, so a tree-depth of two would split another 
quantifier into two more tasks for each state encoded in the previous integer- 
based split. To provide an example: Setting --intsplit 5 creates 5 child-tasks 
in the task tree, spanning over the first [loga 5] = 3 boolean variables from the 
quantifier prefix. When not using doing an integer-based split, these 3 variables 
would have to be expanded over 3 layers in the task tree, each inner task being 
split into two child tasks, resulting in 8 leaves , opposed to the 5 from before. 
Thus, integer-based splits require less intermediate splitting tasks to model the 
same formula, reducing the work to be done by the load-balancing mechanism in 
the Broker module. These integer splits are efficiently distributed over the net- 
work by relying on both the config-system and an extended QBF cube source. 
The cube source always saves the current guiding path, applying new splits, and 
in turn new assumptions, by appending to that path. The cube source itself is 
automatically serialized when a task is chosen to be offloaded to another com- 
pute node. While the possible savings are large, one has to exert great caution 
when using this feature, as it might change the semantics of a formula. 


7 Evaluation 


In this section, we evaluate PARAQOOBA on recent benchmarks and compare it 
to (sequential) state-of-the-art QBF solvers. As sequential backend solvers, we 
use the latest versions of DEPQBF [17] as QCDCL solver, CAQE [23] as clausal- 
abstraction solver, and RAREQS [13] as recursive abstraction refinement solver. 
For preprocessing, we use BLOQQER [3] (version 31). All of these solvers were top- 
ranked in the most recent edition of QBFEval’22 [22]. For our experiments we 
used the benchmarks of the PCNF-track of this competition. The main questions 
we want to answer with our evaluation are as follows: 


— how does the parallel portfolio-leaf approach of PARAQOOBA perform in 
comparison to the individual sequential solvers? 

— how does the parallel portfolio-leaf approach of PARAQOOBA perform in 
comparison to the virtual portfolio solver of the sequential solvers? 

— what is the impact of performing the preprocessing in the leaves instead on 
the original input formula? 


We ran our experiments on machines with dual-socket 16 core AMD EPYC 
7313 processors with 3.7GHz sustained boost clock speed and 256GB main 
memory. Each task was assigned as many physical cores as its setup required, 
except for tasks with more than 32 concurrent threads, which were exclusively 
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Fig. 3: Full summary of all solved instances with all different solvers without 
preprocessing. While Divide-and-Conquer (Depth 4) formulas solves 33 instances 
that no sequential solver solved, it solves 28 instances less in total. 
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Fig. 4: Full summary of all solved instances with all different solvers with BLO- 
QQER preprocessing. PQ Portfolio (Depth 4) solves 45 instances no sequential 
solver could solve and solves 3 more in total. 
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assigned a whole node each as to not be slowed down by other loads. The ef- 
fects of over-committing in case of three concurrent portfolio solvers (48 threads 
running in parallel with only 32 physical cores available) are discussed below in 
subsection 7.3. 

Please note that in this evaluation we do not use the networking features 
provided by PARACOOBA, as we focus on applicability to QBF and not on the al- 
ready presented scalability of the networking component (for the details see [3]). 


7.1 Overall Performance Comparison 


In order to exploit our hardware with 32 physical cores and 64 logical cores in the 
best possible way, we mainly focus on a splitting depth of four in the following. 
With this depth, 16 worker threads are generated for each problem and with 
three sequential backend solvers, overall 48 processes are started. We call this 
configuration PQ Portfolio, Depth 4. For understanding the impact of splitting, 
we also consider other depths as well. With PQ Portfolio, Depth 0 we refer to 
the configuration in which splitting is disabled. This configuration is particularly 
interesting, because compared to the virtual best solver (VBS), it reveals the 
overhead introduced by our framework (see also the discussion below). In order to 
show the improvements of PARAQOOBA compared to the QBF module without 
portfolio solving that was already available in PARACOOBA [6], we also included 
the configuration PQ DepQBF, Depth 4. 

Figure 3 shows the overall results of our evaluation without preprocessing. 
Both configurations of PARAQOOoBA, PQ Portfolio, Depth 0 and PQ Portfolio, 
Depth 4 are considerably better than the single sequential solvers as well as the 
basic non-portfolio QBF module of PARACOOBA only solving with DEPQBF 
(PQ DEPQBF, Depth 4). However, compared to the virtual portfolio, 28 in- 
stances less are solved in total (for an explanation see below). On the positive 
side, 33 formulas can be solved by our new approach that could not be solved by 
any sequential solver. The situation changes when preprocessing is applied (cf. 
Figure 4). Now PARAQOOBA in configuration PQ Portfolio Preprocessed For- 
mulas, Depth 4 is able to solve most formulas. It even solves more formulas than 
the Preprocessed Virtual Portfolio, indicating the potential of our approach. 

A detailed analysis is given in Figure 5. By comparing the number of solved 
instances to the solve time of individual (preprocessed) problem instances, we 
see a small average speedup when using PARAQOOBA with depth 4 compared 
to a virtual portfolio solver in Figure 5a. The more trivial instances tend to be 
solved quicker using a sequential solver, while the harder to solve instances tend 
to be solved faster with the Divide-and-Conquer approach of PARAQOOBA. 

Next, we used the preprocessed leaves functionality introduced in subsec- 
tion 6.2. Here PARAQOOBA generates its guiding paths using the original formula 
and applies BLOQQER only in the leaves of the solve tree. In this configuration, 
some problem instances take longer to solve than when preprocessing the full for- 
mula, while others can be solved quicker. We present these results in Figure 5b. 
Such a result was expected, as it is conceptually similar to inprocessing. 
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(a) PARAQOOBA with Depth 4 compared 
to Virtual Portfolio 


[ER o a 
T + H 
= pHi 
=~ + 
= 1000 b $ a * E 
z + 4 
A E + + + i +] 
£ E Ad 
Â E + i to 
+ 
PY F ee tt P 
3 * + if Aas ae 
= 100 ¢ att +t g 4 
E + + E| 
5 ja a + + tart + + 4 
a t + t snah 
ie] F A al 
A F Hi ah + $ 4 
jA 
+ + + + 
$ ib +t a pete tt a] 
3 E + + at ER + z| 
r E + + as fe # + =| 
a F ++ AD + J 
3 T + +H ti + 4 
A H te th ape + ee 4 
H + + J 
8 4. att Ej 
© ++. 
& Te kä TER # 4 
5 E TO tee t +t 3 
à E " J 
Cat 
+ q 
z E ET 3 
0.1 peal vl i 
0.1 1 10 100 1000 


PQ Portfolio Preprocessed Leaves, Depth 4 [s] 


(c) PARAQOOBA: preprocessing of leaf 
formulas compared to preprocessing 
of input formula 
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(b) Same as a, but with BLOQQER pre- 
processing in leaves 
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(d) PARAQOoBA with Depth 4 compared 
to Virtual Portfolio on Hex formulas 


Fig. 5: Detailed comparison of PARAQOOBA against the virtual portfolio of DE- 
PQBF, CAQE, and RAREQsS in a, b, d. In a, PARAQOOBA solves 45 instances 
that no sequential solver could solve. In b, PARAQOOBA solves 38 instances no 
sequential solver could solve, 8 of which also could not be solved with portfolio 
over preprocessed formulas as in a. d focuses only on preprocessed formulas from 
the Hex benchmark family. In c, we directly compare preprocessing in the leaves 


to preprocessing in the input formula. 


ParAQoosa: Parallel and Distributed QBF Solving 439 


200 l T | 

180 H ‘ + + 4 
g 160 oe ee T * % * % 0 
la roan yi oe aso ay o* č + ae 4 
= 140 wp mpe n a OOM a 
43 oge a 8 oa 
2 120 | 
% 100 ‘ss 4 . y e 
E wn PO Portfolio, Depth 4 + 
9 80 Virtual Portfolio x OF 
S 60 PQ Portfolio, Depth 0 * a” 
: CAQE o 
4 40 DEPQBF = + 

20 HORDEQBF o | 

RAREQs ° 
0 l l l l | | l 
0 500 1000 1500 2000 2500 3000 3500 4000 


time [s] 


Fig. 6: Preprocessed formulas of the Hex positional game planning [20,25] bench- 
marks from the QBF22 benchmark set. Also compared to HORDEQBF [1] as 
available state-of-the-art parallel QBF solver. 


When considering the formulas that were exclusively solved by PARAQOOBA, 
then the variant with preprocessing the full formula up-front performed best 
followed by the variant with preprocessing in the leaves. These formulas include 
verification and synthesis benchmarks with 2-3 quantifier alternations as well 
as many encodings of the game Hex with 13, 15 or 17 quantifier alternations. 
Table 1 in the appendix lists all instances (48) that were only solved with some 
variant of PARAQOOBA. It also lists which variant was the fastest. 


7.2 Family-Based Analysis 


To understand which formula families benefit most from our Divide-and-Conquer 
solving strategy, we compared the (wall-clock) solve time of PARAQOOBA to the 
virtual portfolio solver. We calculated the speedup by dividing the solve time 
of the sequential solver by the solve time of PARAQOOBA. The instances with 
the highest speedups were some reachability queries (up to 18.09), the Hex game 
planning family (17.64), multipliers (16.46), and the formula _add family (15.16). 
More detailed results are appended in Table 2. Together with the number of 
Hex instances only PARAQOOBA solved (21), this makes Hex game planning the 
benchmark family with the best overall results in our evaluation. A comparison 
between PARAQOOBA and other solvers is shown in Figure 6. 


7.3 Scalability of our Approach 


As already discussed above, using 16 workers leads to overcommitting cores 
when solving with a portfolio of more than two solvers. To quantify this, we did 


440 M. Heisinger et al. 


200 T ERATA: ji BETLI T EE T TITI T TIN T BELLL T T rT 
a 180 H a 
3 160 - 5 | 
s 140 - aa 
£120 - a 
g% 100 + 4 
= 80 H PQ Portfolio, Depth 0 (164) + + 
£ 60 L m PQ Portfolio, Depth 1 (169 x _| 
© 49 L ae PQ Portfolio, Depth 2 (176) » | 
= ptt ‘1 a i PQ Portfolio, Depth 3 (181) œo 
= 20 E i i i i i PQ Portfolio, Depth 4 (184) = 7 
0 1 il f 1 rool 1 1 rool 1 1 rool 1 1 rool 1 ih rool 1 1 i ttitt 


0.01 0.1 1 10 100 1000 10000 100000 


worker-time (wall-clock times worker count) [s] 


Fig. 7: Hex Scalability with preprocessed formulas. Depth 4 suffers from over- 
committing the available CPU-cores on our hardware and is relatively slow for 
the first few problems, but still solves more instances overall. 


a scalability experiment with different worker counts. Because the Hex planning 
benchmarks had the most predictable performance, we focused this experiment 
on these formulas. Figure 7 shows the scalability graph, where the X-axis has 
been multiplied by the number of workers used, to visualize the cost of in- 
creased CPU-time compared to reduced wall-clock solve time. The impact of 
over-committing CPU cores can be clearly observed in the results of the portfo- 
lio with depth 4. This curve solves more compared to the others and takes longer 
to solve the first 140 instances, until the curves become more similar again. 


7.4 Preprocessed Leaves compared to Preprocessed Formulas 


We compared preprocessing the whole formula at once using BLOQQER to calling 
BLOQQER using bloqger-popen in each leaf after first splitting on the unchanged 
formula. The first variant modifies the original prefix, including the quantifier or- 
dering. Because the used splitting algorithm generates guiding paths by following 
this quantifier ordering, the different approaches lead to vastly different results. 
Figure 5c visualizes these differences by scattering both variants together. 
Looking at the specific benchmarks benefiting from the two variants, we 
often observed improvements to one variant per family. This strongly suggests 
that adaptive preprocessing and inprocessing techniques could further improve 
solving performance, even without otherwise changing solvers themselves. 


7.5 Lessons Learned 


One would expect that for any given problem, parallel portfolio solvers are as 
fast as the fastest used solver. While this statement is conceptually true, we 
encountered some formulas where PQ-Portfolio gave comparatively bad results, 
while a solver alone could solve the same formula quicker or even instantly. 
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We investigated this in more detail and found several segmentation faults in 
CAQE and API inconsistencies in DEPQBF that were encountered because of 
some corner-case structures of the generated subproblems (e.g., by enforcing the 
values of certain variables). We reported these issues to the solver developers 
and hope to obtain fixes soon. Having this issues fixed would lead to a more 
performant general solution and to a more robust user experience. In sequential 
execution of these solvers, we did not encounter any problems on the unmodified 
competition benchmarks without added unit clauses. 

Currently, we adopt the following work-around. Segmentation faults of the 
sequential solvers are handled in our QBF module using the indirection provided 
by QUAPI. Once an unrecoverable error occurs in the solver child process, it 
exits and returns the error up through QUAPI’s factory process and into the 
solver handle. There, such a result is interpreted as Unknown, which is invalid 
and therefore ignored, letting the portfolio wait for other results. We provide all 
affected formulas that we found in the artifact submitted alongside this paper. 

We also observed that calling a solver via its API might lead to a consider- 
ably different behavior than calling a solver from the command line, i.e., different 
optimizations are activated when calling a solver through its API compared to 
using the command-line binary. Such behavior can be mitigated by not using the 
API directly, and instead relying on QUAPI, even if an API would be available. 
This fixes the issues with DEPQBF, which solves some formulas (with assump- 
tions supplied as unit clauses) in under one second if used as a solver binary, 
but not when applying assumptions through its API. We also supply all found 
formulas that triggered this issue in the submitted artifact. 


8 Conclusions 


We presented PARAQOOBA, a parallel and distributed QBF solving framework 
that combines search-space splitting with portfolio solving. We designed the 
framework in such a way that any sequential QBF solver binary can be eas- 
ily integrated without any implementation effort. Our experiments demonstrate 
that this approach in combination with sequential preprocessing lead to consid- 
erable performance improvements for certain formula families. 

With our framework, we provide a stable infrastructure that has the po- 
tential for many future extensions. For example, we did not incorporate any 
advanced splitting heuristics as in modern Cube-and-Conquer solvers. We ex- 
pect that with more advanced heuristics, combined with adaptive but possibly 
non-deterministic re-splitting of leaves, even more speedups could be achieved. 

In addition to the presented experiments, we also evaluated the novel integer- 
split feature (cf. subsection 6.2) with the Hex benchmark family. By providing 
the number of valid game states to PARAQOOBA, we could increase the split- 
ting depth as well as the number of solved instances. We see much potential of 
providing encoding-specific or domain-specific knowledge to the solver and will 
investigate this in future work. 
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Data Availability Statement 

Data used for benchmarking the described software, including source code, are 
made available permanently under a permissive license in a public artifact on 


Zenodo. Raw source data for the figures presented in this paper are also in- 
cluded [8]. 


A Instances Only Solved by PARAQOOBA 


Name Clauses Variables QA Time [s] Res Variant 
b21_C_3_ 206 242896 3270 3 265.77 T full 
cl_ Debug _s3_fl_el_vl 1775758 379113 3 3164.34 T full 
c2_Debug_s3_fl_el_v2 431970 98425 3 1834.27 T full 
cache-coherence-2-fixpoint-2 10648 3686 2 0.56 L leaves 
cmu.dme1.B-f3 4540 1795 3 0.2 T leaves 
cmu.dme2.B-f3 6151 2342 3 818.3 T leaves 
LoginService 21667 5289 2 1086.07 L orig 
query64__query42__1344n 3423 1426 2 86.73 T full 
hex__compact__ goal witness _ 3401 1056 15 2594.27 1 leaves 
based_hein 03 _ 6x6-13.pg 

hex compact goal_ witness _ 3493 1071 15 3102.97 T full 
based_hein __05_6x6-13.pg 

hex_compact__goal_ witness _ 3430 1060 15 1919.64 1 full 
based_hein __17_6x6-13.pg 

hex_compact__goal_ witness _ 4256 1267 15 1401.12 1 full 
based_hein __18_7x7-13.pg 

hex compact goal_ witness _ 3134 1007 15 308.99 T full 
based_hein _02_5x5-13.pg 

hex_compact__goal_ witness _ 3667 1195 17 3063.67 T full 
based_hein _15_5x5-15.pg 

hex symbolic explicit _ goal _ 3421 902 13 693.11 1 ful 
hein 03 _ 6x6-11.pg 

hex symbolic explicit goal_ 3611 918 13 501.29 1 full 
hein 05 _ 6x6-11.pg 

hex symbolic explicit _ goal _ 3084 1021 13 447.7 l1 leaves 
hein _18_7x7-11.pg 

hex_symbolic explicit goal _ 2480 739 13 973.33 L full 
hein 02 5x5-11.pg 

hex symbolic explicit goal__ 2376 731 13 301.31 L full 
hein_16_5x5-11.pg 

hex_symbolic__implicit__goal _ 3069 1001 15 1830.57 L full 
hein _03_6x6-13.pg 

hex symbolic implicit  goal_ 3097 1005 15 2674.38 L full 


hein 17 _ 6x6-13.pg 
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404.36 


1944.27 


2050.04 


1005.06 


1572.7 


1102.69 


3123.99 


2489.7 


1852.22 
2693.19 
2897.47 
2469.04 
2169.18 
3054.75 
3489.44 
1782.51 
1609.48 
2055.46 
2253.59 
870.16 

2163.5 

1310.32 
2592.6 

1765.8 

2328.99 
2123.52 
2803.72 


leaves 


leaves 


leaves 


full 
leaves 
leaves 
leaves 
leaves 
leaves 


Table 1: 48 instances that were only solved by a PARAQOOBA configuration. 
QA: Quantifier Alternations, Res: Result, Variant: PARAQOOBA configuration 
that solved the problem the fastest (preprocess full formula, preprocess leaves, 


original formula). 
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B Instances Solved faster by PARAQOOBA 


Name PQ [s] VPS [s] Speedup Res 
nreachq_query71_1344n 2.21 39.97 18.09 L 
hex witness based_hein 08 _5x5-11.pg 0.22 3.88 17.64 T 
mult9.sat 2.11 34.73 16.46 T 
add5_COMPLETE 1.78 26.98 15.16 T 
hex_symbolic_explicit_goal_hein_10_5x5-11.pg 32.23 465.43 14.44 Ll 
hex_compact__goal_witness_based_hein __10 144.98 1853.09 12.78 T 
5x5-13.pg 


hex_symbolic__explicit_goal_hein_11_5x5-09.pg 1.79 22.53 12.59 L 
hex_symbolic_implicit_goal_hein_03_6x6-11.pg 47.52 538.03 11.32 Ll 
reachqu_ query60_ 1344n 7.57 TTA 10.22 dls 
query71_query36_ 1344n 11.38 105.83 9.30 L 
hex_symbolic__explicit_goal_hein_08_5x5-09.pg 1.18 10.94 9.27 dle 
hex_symbolic_implicit_ goal_ hein _20_6x6-11.pg 140.49 1282.38 9.13 ze 


hex witness based_hein 06 4x4-11.pg 3.41 30.9 9.06 de 
hex compact goal_witness_based_hein_ 10 13.97 121.04 8.66 J: 
5x5-11.pg 


hex_symbolic_implicit_goal_hein_19_5x5-11.pg 1.69 14.29 8.46 T 
hex_symbolic_implicit_goal_hein_16_5x5-11.pg 22.26 184.75 8.30 als 


sortnetsort10.AE.stepl.008 13.33 107.07 8.03 L 
add7_ REDUCED 135.58 1051.44 7.76 T 
reachqu_query64_ 1344n 128.4 982.54 7.65 IE 
hex_compact__goal_ witness __based__hein _02 39.04 295.57 7.57 L 
5x5-11.pg 

amba4b9y.unsat 10.9 81.72 7.50 d 
hex_symbolic_implicit_goal_hein_15_5x5-13.pg 95.67 714.78 7.47 alt 
hex compact goal_witness_based_hein_ 15 167.18 1229.74 7.36 L 
5x5-13.pg 

hex_symbolic_implicit_goal_hein_06_4x4-11.pg 1.32 9.67 7.33 L 
hex compact _goal_witness__based_hein__16 372.26 2713.59 7.29 T 
5x9-13.pg 


Table 2: Instances that PARAQOOBA (PQ) solved faster compared to a virtual 
portfolio solver (VPS) that also solved the same problem, ordered by the relative 
VPS[s] 
PQIs] ` 


speedup and limited to the top 25 entries. Res: Result, Speedup: 
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Abstract. Efficiency is a fundamental property of any type of program, 
but it is even more so in the context of the programs executing on the 
blockchain (known as smart contracts). This is because optimizing smart 
contracts has direct consequences on reducing the costs of deploying and 
executing the contracts, as there are fees to pay related to their bytes-size 
and to their resource consumption (called gas). Optimizing memory usage 
is considered a challenging problem that, among other things, requires a 
precise inference of the memory locations being accessed. This is also 
the case for the Ethereum Virtual Machine (EVM) bytecode generated 
by the most-widely used compiler, solc, whose rather unconventional 
and low-level memory usage challenges automated reasoning. This paper 
presents a static analysis, developed at the level of the EVM bytecode 
generated by solc, that infers write memory accesses that are needless 
and thus can be safely removed. The application of our implementation on 
more than 19,000 real smart contracts has detected about 6,200 needless 
write accesses in less than 4 hours. Interestingly, many of these writes were 
involved in memory usage patterns generated by solc that can be greatly 
optimized by removing entire blocks of bytecodes. To the best of our 
knowledge, existing optimization tools cannot infer such needless write 
accesses, and hence cannot detect these inefficiencies that affect both the 
deployment and the execution costs of Ethereum smart contracts. 


1 Introduction 


EVM and memory model. Ethereum [27] is considered the world-leading 
programmable blockchain today. It provides a virtual machine, named EVM 
(Ethereum Virtual Machine) [21], to execute the programs that run on the 
blockchain. Such programs, known as Ethereum “smart contracts”, can be writ- 
ten in high-level programming languages such as Solidity [6], Vyper [4], Serpent [3] 
or Bamboo [1] and they are then compiled to EVM bytecode. The EVM bytecode 
is the code finally deployed in the blockchain, and has become a uniform format 
to develop analysis and optimization tools. The memory model of EVM pro- 
grams has been described in previous work [17, 19,26,27]. Mainly, there are three 
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regions in which data can be stored and accessed: (1) The EVM is a stack-based 
virtual machine, meaning that most instructions perform computations using 
the topmost elements in a machine stack. This memory region can only hold a 
limited amount of values, up to 1024 256-bit words. (2) EVM programs store 
data persistently using a memory region named storage that consists of a map- 
ping of 256-bit addresses to 256-bit words and whose contents persist between 
external function calls. (3) The third memory region is a local volatile memory 
area that we will refer to as EVM memory, and which is the focus of our work. 
This memory area behaves as a simple word-addressed array of bytes that can 
be accessed by byte or as a one-word group. The EVM memory can be used 
to allocate dynamic local data (such as arrays or structs) and also for specific 
EVM bytecode instructions which have been designed to require some lengthy 
operands to be stored in local memory. This is the case of the instructions for 
computing cryptographic hashes, or for passing arguments to and returning data 
from external function calls. Compilers use the stack and volatile memory regions 
in different ways. The most-used Solidity compiler solc generates EVM code 
that uses the stack for storing value-type local variables, as well as intermediate 
values for complex computations and jump addresses, whereas reference-type 
local variables such as array types and user-defined struct types are located in 
memory. For instance, when a Solidity function returns a struct variable, the 
required memory for the struct is allocated and initialized at the beginning of 
the function execution. However, the allocated memory is not always accessed as 
we illustrate in the following function (that belongs to the contract in Fig. 1): 

1 function _ownershipAt(uint256 i) private returns (TokenOwnership memory) { 

: } return c.unpackedOwnership(-packedOwnerships[i]); 

Although the execution of _ownershipAt allocates memory for the return value de- 
clared in the function definition, the execution of the function is reserving a differ- 
ent memory space for the actual returned struct obtained from unpackedOwnership 
and, thus, the first reservation and its initialization are needless. The focus of our 
work is on detecting such needless write memory accesses on the code generated 
by solc. Nevertheless, as the analysis works at EVM level, it could be easily 
adapted to EVM code generated by any other compiler. 


Optimization. Optimization of Ethereum smart contracts is a hot research topic, 
see e.g. [9, 10, 12-14, 22, 24] and their references. This is because the reduction of 
their costs is relevant for three reasons: (1) Deployment fees. When the contract 
is deployed on the blockchain, the owner pays a fee related to the size in bytes 
of the bytecode. Hence, a clear optimization criterion is the bytes-size of the 
program. The Solidity compiler solc [6] has as optimization target such bytes-size 
reduction. (2) Gas-metered execution. There is a fee to be paid by each client to 
execute a transaction in the blockchain. This fee is a fixed amount per transaction 
plus the cost of executing all bytecode instructions within the function being 
invoked within the transaction. This cost is measured in “gas” (which is then 
priced in the corresponding cryptocurrency) and this is why the execution is 
said to be gas-metered. The EVM specification ([27] and more recent updates) 
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provides a precise gas consumption for each bytecode instruction in the language. 
The goal of most EVM bytecode optimization tools [9, 10, 12-14, 22] is to reduce 
such gas consumption, as this will revert on reducing the price of all transactions 
on the smart contract. (3) Enlarging Ethereum’s capability. Due to the huge 
volume of transactions that are being demanded, there is a huge interest in 
enlarging the capability of the Ethereum network to increase the number of 
transactions that can be handled. Optimization of EVM bytecode in general 
-and of its memory usage in particular- is an important step contributing into 
this direction. 


Challenges and contributions. Optimizing memory usage is considered a chal- 
lenging problem that requires a precise inference of the memory locations being 
accessed, and that usually varies according to the memory model of the language 
being analyzed, and to the compiler that generates the code to be executed. 
In the case of Ethereum smart contracts generated by the solc compiler, the 
memory model is rather unconventional and its low-level memory usage patterns 
challenge automated reasoning. On one hand, instead of having an instruction to 
allocate memory, the allocation is performed by a sequence of instructions that 
use the value stored at address 0x40 as the free memory pointer, i.e., a pointer to 
the first memory address available for allocating new memory. In the general case, 
the memory is structured as a sequence of slots: a slot is composed of several 
consecutive memory locations that are accessed in the bytecode from the same 
initial memory location plus a corresponding offset. A slot might just hold a data 
structure created in the smart contract but also, when nested data structures 
are used, from one slot we can find pointers to other memory slots for the nested 
components. Finally, there are other type of transient slots that hold temporary 
data and that need to be captured by a precise memory analysis as well. These 
features pose the main challenges to infer needless write accesses and, to handle 
them accurately, we make the following main contributions: (1) we present a 
slot analysis to (over-)approximate the slots created along the execution and 
the program points at which they are allocated; (2) we then introduce a slot 
usage analysis which infers the accesses to the different slots from the bytecode 
instructions; (3) we finally infer needless write accesses, i.e., program points 
where the memory is written but is never read by any subsequent instruction 
of the program; and (4) we implement the approach and perform a thorough 
experimental evaluation on real smart contracts detecting needless write accesses 
which belong to highly optimizable memory usage patterns generated by solc. 
Finally, it is worth mentioning that the applications of the memory analysis 
(points 1 and 2) go beyond the detection of needless write accesses: a precise 
model of the EVM memory is crucial to enhance the accuracy of any posterior 
analysis (see, e.g., [19] for other concrete applications of a memory analysis). 


2 Memory Layout and Motivating Examples 


Memory Opcodes. The EVM instruction set contains the usual instructions to 
access memory: the most basic instructions that operate on memory are MLOAD 
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struct TokenOwnership { 17 contract Running2 { 
address addr; 18 Runningl c; 
uint64 startTs; 19 mapping(uint256=>uint256) private _packedOwnerships; 
bool burned; 20 iy 
} 21 function _ownershipAt(uint256 i) private 
22 S6 returns (TokenOwnership memory) { 
contract Runningl { 23 Sy return c.unpackedOwnership(_packedOwnerships[i]) ; 


function unpackedOwnership 25 function explicitOwnershipOf(uint256 tokenld) 


(uint256 packed) public 26 s3 public returns (TokenOwnership memory) { 

Sıs2 returns (TokenOwnership 27 s4 TokenOwnership memory ownership; 
memory ownership) { 28 ss if (...) { return ownership; } 
ownership.addr = ...; 29 s ownership = _ownershipAt(tokenld); 
ownership.startTs = ...; 30 ess 
ownership. burned = ...; 31 S return ownership; 

} 2} 
} 33 } 


Fig. 1: Excerpt of smart contract ERC721A. 


and MSTORE, which load and store a 32-byte word from memory, respectively.’ 
The solc compiler generates code to handle memory with a cumulative model 
in which memory is allocated along the execution of the program and is never 
released. In contrast to other bytecode virtual machines, like the Java Virtual 
Machine, the EVM does not have a particular instruction to allocate memory. 
The allocation is performed by a sequence of instructions that use the value 
stored at address 0240 as the free memory pointer, i.e., a pointer to the first 
memory address available for allocating new memory. In what follows, we use 
mem(x) to refer to the content stored in memory at location x. 


Memory Slots. In the general case, memory is structured as a sequence of slots. 
A slot is composed of consecutive memory locations that are accessed by using 
its initial memory location, which we call the base reference (baseref for short) of 
the slot, plus the corresponding offset needed to access a specific location within 
the slot. Slots usually store (part of) some data structure created in the Solidity 
program (e.g., an array or a struct) and whose length can be known. 


Example 1 (slots). Fig. 1 shows an excerpt of smart contract ERC721A [2] 
which contains two different contracts Running1 and Running2. We have omitted 
non-relevant instructions such as those that appear at lines 15-17 (L15-L17 for 
short). The contract Running1 to the left of Fig. 1 contains the public function 
unpackedOwnership that returns a struct of type TokenOwnership defined at L4- 
L7. The contract Running2, shown to the right, contains the public function 
explicitOwnership0f that returns, depending on a non-relevant condition, an 
empty struct of type TokenOwnership (L29) or the TokenOwnership received from 
a call to function unpackedOwnership of contract Running1 (L23), which is done in 
the private function _ownershipAt. The execution of function unpackedOwnership 
in Running1 allocates two different memory slots at L13: sı, for the returned 
variable ownership, and s2, which is used for actually returning from the function 
the contents of ownership: 


3 Although the local memory is byte addressable with instruction MSTORE8, to keep the 
description simpler, we only consider the general case of word-addressable MSTORE. 
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saa sı (L13) S2 (L13) 
TT 
0x00 0x20 0x40 0x60 bref=0x80 bref=0x80+0x60 


The function explicitOwnershipOf in Running2 makes a more intensive use of the 
memory which can be seen in this graphical representation: 
$6(L22) $7(L23) Ssg(L31) 

0x00-0x60 $3 (L27) $4 (L28) S5 (L33-L29) 
FEE see ee] —— 7a 
The execution of this function might create up to six different slots. At L27 and 
L28, it creates two slots, one for the struct declared in the returns part of the 
function header (s3) and one for the local variable ownership (s4). Depending 
on the evaluation of the condition in the if sentence, it might create the slots 
needed to perform the call to _ownershipAt and, consequently, the external call 
to Running1.unpackedOwnership. The invocation to the private function involves 
three slots: one for the struct declared in the returns part of _ownershipAt in 
L31 (sg), one slot to manage the external call data in L23 (s7), and one slot for 
storing the results of the private function _ownershipAt in L31 (sg). Finally, a 


new slot (s5) is created for returning the results of explicitOwnership0f. This 
new slot might contain the contents of s4 or sg, depending on the if evaluation. 


When an amount of memory t is to be allocated, the slot reservation is made 
by reading and incrementing the free memory pointer (mem(0x40)) t positions. 
From this update on, the base reference to the slot just allocated is used, and 
subsequent accesses to the slot are performed by means of this baseref, possibly 
incremented by an offset. 


Example 2 (memory slot reservation). The following excerpt of EVM code allo- 
cates a slot of type TokenOwnership. The EVM bytecode performs three steps: 
(i) load the current value of the free memory _ ox7s: JUMPDEST 

pointer mem(0x40) that will be used as the SH”: oa y a) T 
baseref of the new slot; (ii) compute the new DUP1 

free memory address by adding t to the baseref, oa Day een 
and (iii), store the new free memory pointer oxı: PUSH1 0x40 

in mem(0x40). Additionally, in the same block %75: MSTORE 77 (iti) 

of the CFG, the slot reservation is followed "STORE // baseref+0n00 


by the slot initialization at 0x19A, 0x1AB and =" MSTORE // baseref +0220 
Ox1B4. 0x184: MSTORE // baseref+0r40 


Solidity reference type values such as arrays, struct typed variables and strings 
are stored in memory using this general pattern, with some minor differences. 
However, there are some cases in which the steps detailed above vary and the 
size of the slot is not known in advance, and thus the free memory pointer cannot 
be updated at this point. For instance, when data is returned by an external call, 
its length is unknown beforehand and hence the free memory pointer is updated 
only after the memory pointed to is written. In other cases, the free memory is 
used as a temporary region with a short lifetime, as in the case of parameter 
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passing to external calls, and the free memory pointer is not updated. These 
variants of the general schema must be detected by a precise memory analysis. 
To this end, we consider that a slot is in transient state when its baseref has been 
read from mem(0x40) but the free memory pointer has not been updated, and it 
is in permanent state when the free memory pointer has been pushed forward. 


Example 3 (transient slot). Now we focus on the external call in L23 of Running2, 
which performs a STATICCALL, reading from the stack (see [27] for details) the 
memory location of the input arguments and the location where the results of the 
call will be saved. Interestingly, both locations reuse the same slot (it corresponds 
to s7) as it can be seen in the following EVM bytecode from _ownerShipat: 


PUSH4 Oxb04dd20b // func. selector PUSH 0x40 
ae 0x132: MLOAD // slot baseref 
PUSH1 0x40 mee 

ox114: MLOAD // baseref transient slot 0x139: ST ATICCALL // external call 
DUP2 PUSH1 0x40 
MSTORE // stores func. selector 0x151: MLOAD // slot baseref 
PUSHi 0x04 RETURNDATASIZE 
ae df of iet Of TUET: Jeng si: ADD // baseref + data size 
ee // copy func. args. oe 
MSTORE // stores func. args. oxisE: PUSH1 0x40 


ox160: MSTORE // permanent slot 

The call starts by reading the free memory pointer (at 0x114) and storing at 
that address the arguments’ data (which include the function selector as first 
argument). Importantly, the pointer is not pushed forward when the input 
arguments are written and thus the slot remains in transient state. Once the 
call at 0x139 is executed, the result is written to memory from the baseref on 
(overwriting the locations used for the input arguments) and the slot is finally 
made permanent by reading the free memory pointer again (0x151) and updating 
it (0x160) by adding the actual return data size (RETURNDATASIZE). 

Transient slots are also used when returning data from a public function to 
an external caller. In that case, the EVM code of the public function halts its 
execution using a RETURN instruction. It reads from the stack the memory location 
where the length and the data to be returned are located. However, it does not 
change mem(0x40) because the function code halts its execution at this point, as 
we can see in the EVM code of explicitOwnershipO0f (corresponds to slot ss): 


PUSH1 0x40 PUSH1 0x40 
oxap: MLOAD //ret slot baseref oxsa:MLOAD = //ret slot revisit 
aie DUP1 
MSTORE t.add t+0x00 
Ss PTE OLANE Ce z00) SWAP2 //Baseref of ret plus size 
MSTORE // ret.startTs (ret+0r20) SUB //Size of ret data 
ox5E: SWAP 1 


MSTORE // ret. burned (ret+0z40) ox5F:RETURN //ret returned 


The baseref for the return slot is read (at 0x4D) and it is used as a transient slot 
to write the struct contents to be returned by adding the corresponding offset for 
each field contained in the struct (instructions on the left column). The code on 
the left ends with the baseref plus the size of the stored data on top of the stack. 
After that, the baseref is read again (top of the right column) and the length of 
the returned data is computed (by subtracting the baseref to the baseref plus 
the size of the stored data) before calling the RETURN instruction. 
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3 Inference of Needless Write Accesses 


This section presents our static inference of needless write accesses. We first 
provide some background in Sec. 3.1 on the type of control-flow-graph (CFG) and 
static analysis we rely upon. Then, the analysis is divided into three consecutive 
steps: (1) the slot analysis, which is introduced in Sec. 3.2, to identify the slots 
created along the execution and the program points at which they are allocated; 
(2) the slot usage analysis, presented in Sec. 3.3, which computes the read and 
write accesses to the different slots identified in the previous step; and (3) the 
detection of needless write accesses, given in Sec. 3.4, which finds those program 
points where there is a write access to a slot which has no read access later on. 


3.1 Context-Sensitive CFG and Flow-Sensitive Static Analysis 


The construction of the CFG of Ethereum smart contracts is a key part of any 
decompiler and static analysis tool and has been subject of previous research [15, 
16,25]. The more precise the CFG is, the more accurate our analysis results will 
be. In particular, context-sensitivity [16] on the CFG construction is vital to 
achieve precise results. Our implementation of context-sensitivity is realized by 
cloning the blocks which are reached from different contexts. 

Example 4 (context-sensitive CFG). The EVM code of Running2 creates multiple 
slots for handling structs of type TokenOwnership. Interestingly, all these slots 
are created by means of the same EVM code shown in Ex. 2, which corresponds 
to the CFG block that starts at program point 0x175. As this block is reached 
from different contexts, the context-sensitive CFG contains three clones of this 
block: 0x175, which creates s3 at L27; 0x175_0, which creates s4 used at L28; and 
0x175_1, which reserves sg, created at L22. Block cloning means that program 
points are cloned as well, and we adopt the same subindex notation to refer 
to the program points included in the cloned block: e.g. program point 0x178 
contains the MLOAD 0x40 that gets the baseref of the slot reserved at block 0x178, 
and 0x178_0 to the same MLOAD but at 0x178_0, etc. 

In what follows, we assume that cloning has been made and the memory 
analysis using the resulting CFG (with clones) is thus context-sensitive as well, 
without requiring additional extensions. As usual in standard analyses [23], one 
has to define the notion of abstract state which defines the abstract information 
gathered in the analysis and the transfer function which models the analysis 
output for each possible input. Besides context-sensitivity, the two analyses that 
we will present in the next two sections are flow-sensitive, i.e., they make a 
flow-sensitive traversal of the CFG of the program using as input for analyzing 
each block of the CFG the information inferred for its callers. When the analysis 
reaches a CFG block with new information, we use the operation U to join the 
two abstract states, and the operator E to detect that a fixpoint is reached and, 
thus, that the analysis terminates. The operations U and C, the abstract state, 
and transfer function, will be defined for each particular analysis. 


3.2 Slot Analysis 


The slot analysis aims at inferring the abstract slots, which are an abstraction 
of all memory allocations that will be made along the program execution. The 
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slots inferred are abstract because over-approximation is made at the level of the 
program points at which slots are allocated. Therefore, an abstract slot might 
represent multiple (not necessarily consecutive) real memory slots, e.g., when 
memory is allocated within a loop. The slot analysis will look for those program 
points at which the value stored in mem(0x40) is read for reserving memory space. 
These program points are relevant in the analysis for two reasons: firstly, to 
obtain the baseref of the memory slot, and, secondly, because from this point on, 
the memory reservation of the corresponding slot has started and it is pending 
to become permanent at some subsequent program point. The output of the 
slot analysis is a set which contains the allocated abstract slots, named Sau in 
Def. 2 below. Each allocated abstract slot (i.e., each element in Sau) is in turn 
a set of program points, as the same abstract slot might have several program 
points where mem(0x40) is read before its reservation becomes permanent. In 
order to obtain Sau, the memory analysis makes a flow-sensitive traversal of 
the (context-sensitive) CFG of the program that keeps at every program point 
the set of transient slots (i.e. whose baseref has been read but it has not yet 
made permanent) and applies the transfer function in Def. 1 to each bytecode 
instruction within the blocks until a fixpoint is reached. An abstract state of 
the analysis is a set S C o(Pr), where Pe is the set of all program points at 
which mem/(0x40) is read. The analysis of the program starts with S = {0} at 
all program points and takes U and E as the set union and inclusion operations. 
Termination is trivially guaranteed as the number of program points is finite 
and so is 9(PR). In what follows, Ins is the set of EVM instructions and, for 
simplicity, we consider MLOAD 0x40 and MSTORE 0x40 as single instructions in Ins. 


Definition 1 (slot analysis transfer function). Given a program point pp 
with an instruction I € Ins, an abstract 


state S, and K = {MSTORE 0x40, RETURN, REVERT, , — — v(i, S) - 
STOP, SELFDESTRUCT}, the slot analysis transfer func- ® 40 AE U PPI ES} 
j j ; D| EK {0} 
tion v is defined as a mapping v : Ins x e(S) => 

(3) | otherwise S 


p(S) computed according to the following table: 


Let us explain intuitively how the above transfer function works. As we have 
seen in Sec. 2, in an EVM program all memory reservations start by reading 
mem(0x40) by means of a MLOAD instruction preceded by a PUSH 0x40 instruction 
(case 1 in Def. 1). In this case, the transfer function adds to all sets in S the 
current program point, since this is, in principle, an access to the same slots that 
were already open at this program point and are not permanent yet. To properly 
identify the slots, our analysis also searches for those program points at which 
slots reservations are made permanent (case 2 in Def. 1), i.e., those program 
points with instructions J € K. The most frequently used instruction to make 
a slot reservation permanent is a write access to mem(0x40) using MSTORE, that 
pushes forward the free memory pointer such that any subsequent read access to 
mem(0x40) will allocate a different slot. The rest of instructions in K finalize the 
execution in different forms (a normal return, a forced stop, a revert execution, 
etc.). In all such cases, the slot needs to be considered as a permanent slot so 
that we can reason later on potential needless write accesses involved in it. The 
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set S is empty after these instructions since all transient (abstract) slots are 
made permanent after them. We use the notation Spp to refer to the abstract 
state computed at program point pp. 


Example 5 (slot analysis). The slot analysis of Running2 starts with S,,={0} 
at all program points. When it reaches the block that starts at 0x175 (see 
Ex. 2) Sous is {0} and it remains empty until 0x178, where the baseref of s3 
is read and hence Soxize={{0x178}}. This slot is made permanent when the free 
memory pointer is updated at 0x17F, thus having Soun.={{0x178}} and Sour ={0}. 
Following the same pattern, s4 and sg are resp. reserved at instructions 0x178_0 
and 0x178_1 and closed at 0x17F_0 and 0x17F_1 (at the cloned blocks). On the 
other hand, the baseref of ss is read at two consecutive program points (0x4D 
and 0x5A) and updated at 0x5F, and thus, we have So.»={{0x4D}} and the same 
until Soxss={{0x4D, 0x5A}} and again the same until Səasr={0}. Finally, after the 
execution of STATICCALL (see Ex. 3) we have three consecutive reads of mem(0x40) 
at 0x114, 0x132 and 0x151 that refer to the same slot s7, which is made permanent 
at 0x160. Therefore, we have Sorus:={{0x114, 0x132, 0x151}} and Sorso = {0}. 


Using the transfer function, as mentioned in Sec. 3.1, our analysis makes a 
flow-sensitive traversal of the (context-sensitive) CFG of the program that uses 
as input for analyzing each block the information inferred for its callers. When a 
fixpoint is reached, we have an abstract state for each program point that we use 
to compute the set of abstract slots allocated in the program, named Satı- 


Definition 2. The set of allocated abstract slots Sau is defined as 
Sall = UppePw Spp—1; where Pw is the set of all program points pp:I where TEK. 


Example 6 (Sau computation). With the values of Sssıze-1; Soximro-1; Soxare tty Soxieo-1 

and Sasr-ı from Ex. 5, at the end of the slot analysis of Running2, we have: 

Sau={{0x178}, {0x178_0}, {0x178_1}, {0x114, 0x132, 0x151}, {Ox5A, 0x4D},... }. 
Yeas een eer eee SE SY 


53 s4 s6 87 85 
Note that, the cloning of block 0x175 allows our analysis to detect three different 
slots, s3, s4 and sg, for the same program point, 0x178, in the original EVM code. 


The next example shows the behavior of the analysis when the program 
contains loops, and an abstraction is needed for approximating the slots. 


Example 7 (loops). Fig. 2 shows the contract Running3 that includes the func- 
tion explicitOwnershipsOf from the smart contract at [2] (made through a 
STATICCALL). This function receives an array of token identifiers as argument 
and returns an array of TokenOwnership structs that is populated invoking the 
function explicitOwnershipOf from Running2 inside a loop. The slots identified 
by the analysis for contract Running3 shown in Fig. 2 are: sg, which is created 
for making a copy of parameter tokenIds to memory; sio, which creates the 
local array ownerships (L44) that contains the array length and pointers to the 
structs identified initially by sıı (and later on by 513); s12 for STATICCALL input 
arguments and return data (L46); sı3 which abstracts the structs for storing the 
STATICCALL output results (L46); and s14, which includes the length of ownership 
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37 contract Running3 { 
33 Running2 c; 


I Jfevs 

40 sg function explicitOwnershipsOf(uint256[] memory tokenlds) 

41 public view returns (TokenOwnership[] memory) { 

42 unchecked { 

43 uint256 tokenldsLength = tokenlds. length; 

44 519511 TokenOwnership[] memory ownerships = new TokenOwnership [](tokenldsLength) ; 
45 for (uint256 i; i != tokenldsLength; ++i) 

46 512513 ownerships[i] = c.explicitOwnershipOf(tokenlds[i]) ; 

47 

48 S14 return ownerships; 

49 

50 } 

SE so(L40)  sio(L44) _ S1 (D44) sia sig s12 S13 S12 $13 s14 (L48) 


re 
| | | |_| tonentas fovnersnins}oroy] ~- Jotnafctolfocoa] ~~ | ~~ [eto] fotma] +a fror] = [stad 


Fig. 2: Solidity code of contract Caller. 


and a copy of sı3 for returning the results (L48). The important point is that, 
the local array declaration at L44 produces a loop to allocate as many structs 
as elements are contained in the array. For this reason, s); is an abstract slot 
that represents all TokenOwnership’s initially added to the array. Similarly, s12 
and sı3 are created inside the for loop, and each abstract slot represents as many 
concrete slots as iterations are performed by the loop. Note that, each iteration 
of the loop creates one instance of s12 for getting the results from the call, and it 
is copied later to s13 and pointed by ownerships (s10). 


As notation, we will use a unique numeric identifier (1, 2, ...) to refer to each 
abstract slot (represented in Sau as a set) and retrieve it by means of function 
get_id(a),a € Sau. We use A to refer to the set of all such identifiers in the program. 
Also, given a program point pp with an instruction MLOAD 0x40, we define the 
function get_slots(pp) to retrieve the identifiers of the elements of Sau that might 
be referenced at pp as follows: get_slots(pp) = {id | a E€ Sau App € aAid = get id(a)}. 


3.3 Slot Access Analysis 


While Sec. 3.2 looked for allocations, the next step of the analysis is the inference 
of the program points at which the inferred abstract slots might be accessed. To 
do so, our slot access analysis needs to propagate the references to the abstract 
slots that are saved at the different positions of the execution stack. Importantly, 
we keep track, not only of the stack positions, but also, in order to abstract 
complex data structures stored in memory (e.g., arrays of structs), we need to 
keep track of the abstract slots that could be saved at memory locations. As seen 
in Ex. 7, a memory location within a slot might contain a pointer to another 
memory location of another slot, as it happens when nested data structures are 
used. Thus, an abstract state is a mapping at which we store the potential slots 
saved at stack positions or at memory locations within other slots. 


Definition 3 (memory analysis abstract state). A memory analysis ab- 
stract state is a mapping n of the form T U A > @(A). 


458 E. Albert et al. 


T is the set containing all stack positions, which we represent by natural 
numbers from 0 (bottom of the stack) on, and A is the set of abstract slots 
identifiers computed in Sec. 3.2. We refer to the set of all memory analysis 
abstract states as AS. Note that, for each entry, we keep a set of potential slots 
for each stack position because a block might be reached from several blocks 
with different execution stacks, e.g., in loops or if-then-else structures. In what 
follows, we assume that, given a value k, the map 7 returns the empty set when 
k Z dom(z). The inference is performed by a flow-sensitive analysis (as described 
in Sec. 3.1) that keeps track of the information about the abstract slots used at 
any program point by means of the following transfer function. 


Definition 4 (memory analysis transfer function). Given an instruction 
I with n input operands at program point pp and an abstract state 7, the memory 
analysis transfer function 7 is defined as a mapping t:Ins x AS +> AS of the form: 


I TIT) I T(I, 7) 
(1) MLOAD 0z40|T|t ++ get-slots(pp)] (A) | SWAPi n[t = r(t — i), t— im nr(t)] 
(2)|MLOAD n|[t= {m | s € r(t) ^AmeEr(s)}] (5) | DUPi mitt lH r(t—i+1)] 
(3) | MSTORE m[s ++ m(s) Um(t—1)]\{t, t-1} Vsen(t)]| (6)| otherwise|\x t-n<r<t 


t=top(pp) is the numerical position of the top of the stack before executing I. 


Let us explain the above definition. The transfer function distinguishes between 
two different types of MLOAD: (1) accesses to location mem(0x40), which return the 
baseref of the slots that might be used, taking them from the previous analysis 
through get_slots(p); and (2) other MLOAD instructions, which could potentially 
return slot baserefs from memory locations. Therefore, we have to consider two 
possibilities: if we are reading a memory location which reads a generic value 
(e.g. a number) then z(t) = 0; if we are reading a memory location that might 
store an abstract slot, then m(t) contains all abstract slots that might be stored 
at that memory location. Regarding (3), MSTORE has two operands: the operand 
at t is the memory address that will be modified by MSTORE, and the operand at 
t — 1 is the value to be stored in that address. For each element s in w(t), the 
analysis adds the abstract slots that are in 7(t—1). Other instructions that are 
also treated by the analysis are SWAP* and DUP* shown in (4-5), that exchange or 
copy the elements of the stack that take part in the operation. Finally, all other 
operations delete the elements of the stack that are no longer used based on the 
number of elements taken and written to the stack (case 6). 


Example 8 (transfer). Now we focus on the analysis of block 0x175, shown in 
Fig. 3. As we have already explained, this block is responsible for creating the 
memory needed to work with several structs of type TokenOwnership and it is thus 
cloned in the CFG. In particular, we focus on the clone 0x175_1. The analysis 
of the block starts with a stack of size 7 and includes at positions 3 and 4, the 
abstract slots s3 and s4, which were created at L26 and L27 of Fig. 1. At 0x178_1, 
mem(0x40) is read, and, by means of get_slots(0x178_1) and, considering that 
top(0x178_1)=8, we add to 7 a new entry 8 > se. At 0x179_1, 0x180_1, Ox1AA_1, 
0x1B3_1 the transfer function duplicates a slot identifier stored in the stack. MSTORE 
and POP instructions of the example remove a slot identifier from the stack. 
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PP Instr T PP int vr 

0x175.1 JUMPDEST {3> s3, 4 s4} 0x19A-1 MSTORE {3 53, 4i 54, 8i s6, 9> s6 } 

0x176.1 PUSHI 0x40 {3 s3, 4> s4 } oor 

0x178_1 MLOAD {3 s3, 4-484, 8> s6 } ox1a91 AND {3 s3, 4 s4, 8 s6, 9 s6} 

0x179.1 DUP1 {3m s3, 41484, 8 s6, 9> s6 } oxiaat DUP2 {3 s3, 4 s4, 8> s6, 9 s6, 11 se} 
oxi7A1 PUSH1 0x60 {3 53, 4i s4, 8> s6, 9 s6} ox1aB1 MSTORE {3 S3, 4 s4, 8i s6, 9 s6} 

0x17C-1 ADD {3 s3, 41484, 8> s6, 9 s6 } a 

0x17D-1 PusHi 0x40 {3 s3, 4i s4, 8-56, 9> s6 } 0x1B2.1 ISZERO {3 s3, 4 s4, 8> s6, 9> s6 } 

Ox17F_1 MSTORE {3 s3, 484, 8> s6 } 0x183.1 DUP? {3 s3, 4 s4, 8> s6, 9 s6, 11 se} 
0x180_1 DUP1 {3m s3, 4484, 8 s6, 9> 56} 0x1B4.1 MSTORE {3 s3, 4i s4, 8> s6, 9> s6 } 

oe oxips.1 Pop {3 s3, 4 s4, 8 s6} 

0x198-1 AND {3 s3, 41484, 8> s6, 9 s6 } oxip61 swaP1 {3 s3, 4 s4, 7T s6} 

0x199_1 DUP2 {3m s3, 41454, 8 s6, 9 s6, Ll se } | [0x187.1 zue. {3 s3, 4584, T s6} 


Fig. 3: Block of the CFG that reserves memory slot for struct 


As it is flow-sensitive, the analysis of each block of the CFG takes as input the 
join U of the abstract states computed with the transfer function for the blocks 
that jump to it, and keeps applying the memory analysis transfer function until 
a fixpoint is reached. The operation AL B is the result of joining, by means of 
operation U, all entries from maps A and B. Operation E is defined as expected, 
AE B, when B includes entries that are not in dom(A) or when we have an 
entry v € dom( A) N dom(B) such that A(v) C B(v). Again, termination of the 
computation is guaranteed because the domain is finite. 


Example 9 (joining abstract states). The EVM code of explicitOwnershipOf of 
Fig. 1 uses s5 in both return sentences at L29 and L33 (see Ex. 1). This EVM 
code has a single return block which is reachable from two different paths from 
the if statement, and which come with different abstract states: (1) the path 
that corresponds to L29 comes with 7={3 > sg}, and the other path (L33) with 
m={3 > s4}. Our analysis joins both abstract states resulting in m={3 +> {s4, ss}}. 
Because of this join, we get that the RETURN instruction that comes from lines 
L29 and L33 might return the content of the slots s4 or sg. 


When the fixpoint is reached, the analysis has computed an abstract state 
for each program point pp, denoted by 7p, in what follows. 


Example 10 (complex data structures). The analysis of the code at Fig. 2 shows 
how it deals with data structures that might contain pointers to other structures, 
e.g. ownerships. The abstract slot that represents variable ownerships is $19, which 
is written, by means of MSTORE at two program points, say pp; and pp2 which, resp., 
come from L44 and L46 of the Solidity code. The input abstract state that reaches 
pp, is {2 > 89,6 + 810,84 s10,9 > s11, 10 > sio}, and the transfer function of 
MSTORE leaves the abstract state as npp, = {2 > 89,6 > 810,8 > 810, $10 > sit}. 
At this point, we can see that variable ownerships is initialized with empty 
structs and, to represent it, our analysis includes in m the entry sig > s11 
as it is described in instruction MSTORE of the transfer function at Def. 4. The 
second write to $19 is performed by another MSTORE instruction at pp2. The input 
abstract state for ppg is {2 89,5 +> 810,74 13,8 > $13,9 > $10,810 © 511}, 
and thus we get tpp, = {2 > s9,5 > s10, 7 +> $13, $10 + {511, $13}}. Interestingly, 
at pp2, we detect that sıı might also store the structs returned by the call 
to c.explicitOwnershipOf (tokenIds[i]), identified by s 3, which is added to 
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s10 > {811, $13}. Finally, s19 is read at the end of the method, returning the set 
{s11, 813}, to copy the content of ownerships to s14, the slot used in the return. 


3.4 Inference of Needless Write Memory Accesses 


With the results of the previous analysis, we can compute the maps R and W, 
which are of the form pp +> (A) and capture the slots that might be read 
or written, resp., at the different program points. To do so, as multiple EVM 
instructions, e.g. RETURN, CALL, LOG, CREATE, ..., might read, or write, memory 
locations taking the concrete location from the stack, we define functions mr(I) 
and mu(I) that, given an EVM instruction J, return the position in the stack of 
the address to be read and written by J, resp. If the instruction does not read/write 
any memory position, function mr(I) = L/mw(I) = L. For example, mr(MLOAD) = 0 
as it reads the top of the stack and mw(MLOAD) = L, or mr(STATICCALL) = 2 and 
mw(STATICCALL) = 4. Now, we define the read/write maps R/W: 


Definition 5 (memory read/write accesses map). Given an EVM program 

P, such that pp = I € P and being t=top(pp), we define maps R and W as follows: 
Ty=4 D=1 

R(pp)= e w W(pp)= i jad 


Tpp-1(t—mr(I)) otherwise Tpp-1(t—mwu(I)) otherwise 


Example 11 (R/W maps). Let us illustrate the computation of R(0x139) and 
W(0x139), which contains the STATICCALL of Running2. With the analysis infor- 
mation obtained from the analysis we have that top(0x139) = 16 and 7ox138 = 
{3 = 53,414 54,74 s6, 10 > 87,12 > 87,14 57}, thus we get R(0x139) = {s7} 
and W(0x139) = {s7}, i.e., the slot used for managing the input and the output 
of the external call. Analogously, we get that R(0x178) = {s3} and W(0x178) = 0. 


The last step of our analysis consists in searching for write accesses to slots 
which will never be read later. To do so, we use the information computed in R 
and W. Given the CFG of the program and two program points p and p2, we 
define function reachable(p, p2), which returns true when there exists a path in 
the CFG from p to p2. We define the set write leaks N as follows: 


Definition 6. Given an EVM program and its W and R, we define N as 
N = {pw:s | pw € PA s E€ W(pw) A 7exists_read(pw, s)} 
where exists_read(pw, s) = 3 pr E€ dom(R) | s E€ R(pr) A reachable(pw, pr). 


Intuitively, the set M contains those write accesses, taken from W, that are 
never read by subsequent blocks in the CFG. As both function reachable and the 
sets W and R are over-approximations, the computation of M provides us those 
write accesses that can be safely removed, as the next example shows. 


Example 12. Our analysis detects that at program points 0x19A, 0x1AB and 0x1B4 
there are MSTORE operations that are never read in the subsequent blocks of 
the CFG. Such operations correspond to the memory initialization of s3, which 
is performed at L27 of the code of Fig. 1 (see Ex. 2). Given that these write 
accesses are the only use of the slot, the whole reservation can be safely removed. 
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Moreover, the analysis detects that program points 0x19A_1, 0x1AB_1 and 0x1B4_1, 
which correspond to the reservation of sg performed at L22, are detected as 
needless. In essence, it means that s3 and sg are allocated and initialized but 
are never used in the program. Note that, all these program points belong to 
two blocks cloned: (0x175 and 0x175_1). However, the three MSTORE operations of 
the other clone of the same block (0x175_0), which correspond to the allocation 
at L28 are not identified as non-read, as they might be used in the return of 
the function. For this, the precision of the context-sensitive CFG is necessary 
to identify these MSTORE operations as needless. As a result we cannot eliminate 
the block because it is needed in one of the clones, but still we can achieve an 
important optimization on the EVM code by removing the unconditional jumps 
to this block in the other two cases that would avoid completely the execution of 
all these instructions (and their corresponding gas consumption [27]). 


The soundness of slots and slots access analyses states that, for each concrete 
slot, there exists an abstract slot in Sau that represents it and, that any access 
to memory is approximated by an inferred abstract slot. Technical details can be 
found in an extended report [8]. 


4 Experimental Evaluation 


This section reports on the results of the experimental evaluation of our approach, 
as described in Sec. 3. All components of the analysis are implemented in Python, 
are open-source, and can be downloaded from github where detailed instructions 
for its installation and usage are provided*. We use external components to build 
the CFGs (as this is not a contribution of our work). Our analysis tool accepts 
smart contracts written in versions of Solidity up to 0.8.17 and bytecode for the 
Ethereum Virtual Machine v1.10.25°. The experiments have been performed on 
an AMD Ryzen Threadripper PRO 3995WX 64-cores and 512 GB of memory, 
running Debian 5.10.70. In order to experimentally evaluate the analysis, we 
pulled from etherscan.io [5] the Ethereum contracts bound to the last 5,000 
open-source verified addresses whose source code was available on July 14, 2022. 
From those addresses, the code of 2.18% of them raises a compilation error from 
solc. For the code bound to the 4,891 remaining addresses, the generation of 
the CFG (which is not a contribution of this work) timeouts after 120s on 626 
of them. Removing such failing cases, we have finally analyzed 19,199 smart 
contracts, as each address and each Solidity file may contain several contracts 
in it. Note that 84.86% of the contracts are compiled with the solc version 0.8, 
presumably with the most advanced compilation techniques. The whole dataset 
used will be found at the above github link. 

In order to be in a worst-case scenario for us, we run the memory analysis 
after executing the solc optimizer, i.e, we analyze bytecode whose memory 
usage may have been optimized already by the optimizer available in solc. 
This will allow us also to see if we can achieve further optimization with our 


t https: //github.com/costa-group/EthIR/tree/memory optimizer /ethir 
5 The latest versions released up to Oct 2022. 
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approach. Unfortunately, we have not been able to apply our tool after running 
the super-optimizer GASOL [9], because it does not generate the optimized 
bytecode but rather it only reports on the gas and/or size gains for each of 
the blocks. Nevertheless, a detailed comparison of the techniques that GASOL 
applies and ours is given in Sec. 5, where we justify that GASOL will not find 
any of our needless accesses. From the 19,199 analyzed contracts, the analysis 
infers 679,517 abstract memory slots and detects 6,242 needless write memory 
accesses in 12,803s. These needless accesses occur within the code bound to 780 
different addresses, i.e., 15.95% of the analyzed ones. 

We have computed the number of needless accesses identified by our analysis 
grouped by function and the number of different contracts that contain these 
functions. Some of them such as transferFrom(1736 accesses in 439 contracts), 
transfer (1745 aacesses in 441 contracts), reflectionFromToken(105 accesses in 6 
contracts) or withdraw(54 accesses in 32 contracts) are functions widely used in 
the implementation of contracts based on ERC tokens. A manual inspection of 
the 10 most common public functions with the needless accesses inferred has 
revealed two different sources for them: some of the needles accesses are due to 
inefficient programming practices, while others are generated by the compiler 
and could be improved. As regards compiler inefficiencies, we detected bytecode 
that allocates memory slots that are inaccessible and cannot be used because the 
baseref to access them is not maintained in the stack. For example, when a struct 
is returned by a function, it always allocates memory for this data. However, 
if the return variable is not named in the header of the function, the compiler 
allocates memory for this data although it will never be accessed. If programmers 
are aware of this behavior they can avoid such generation of useless memory 
but, even better, this memory usage patterns can be changed in the compiler. 
For instance, it is reflected in L22 and L27 in Fig. 1, where the functions do 
not name the return variable. Hence, the compiler allocates memory for these 
anonymous data structures which are never used. Similarly, there are various 
situations involving external calls in which the compiler creates memory that is 
never used. When there is an external call that does not retrieve any result, the 
compiler creates two memory slots, one for retrieving the result from the call, 
and another one for copying a potential result to a memory variable that is never 
used. Finally, the compiler also creates memory that is never used for low-level 
plain calls for currency transfer. Even though the contract code does not use 
the second result returned by the low-level call, the compiler generates code for 
retrieving it. All these potential optimizations have been detected by means of 
our inference of needless write accesses and will be communicated to the solc 
developers. 


5 Conclusions and Related Work 


We have proposed a novel memory analysis for Ethereum smart contracts and 
have applied it to infer needless write memory accesses. The application of our 
implementation over more than 19,000 real smart contracts has detected some 
compilation patterns that introduce needless write accesses and that can be easily 
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changed in the compiler to generate more efficient code. Let us discuss related 
work along two directions: (1) memory analysis and (2) memory optimization. 
Regarding (1), we can find advanced points-to analysis developed for Java-like 
languages [7, 11,18, 20]. Focusing on EVM, the static modeling of the EVM 
memory in [16] has some similarities with the memory analysis presented in 
Secs. 3.2 and 3.3, since in both cases we are seeking to model the memory 
although with different applications in mind. There are differences on one hand 
on the type of static analysis used in both cases: [16] is based on a Datalog 
analysis while we have defined a standard transfer function which is used within 
a flow-sensitive analysis. More importantly, there are differences on the precision 
of both analyses. We can accurately model the memory allocated by nested 
data structures in which the memory contains pointers to other memory slots, 
while [16] does not capture such type of accesses. This is fundamental to perform 
memory optimization since, as shown in the running examples of the paper, it 
allows detecting needless write accesses that otherwise would be missed. Finally, 
the application of the memory analysis to optimization is not studied in [16], 
while it is the main focus of our work. 

As regards (2), optimizing memory usage is a challenging research problem 
that requires to precisely infer the memory positions that are being accessed. 
Such positions sometimes are statically known (e.g., when accessing the EVM 
free memory pointer) but, as we have seen, often a precise and complex inference 
is required to figure out the slot being accessed at each memory access bytecode. 
Recent work within the super-optimizer GASOL [9] is able to perform some 
memory optimizations at the level of each block of the CFG (i.e., intra-block). of 
There are three fundamental differences between our work and GASOL: First, 
GASOL can only apply the optimizations when the memory locations being 
addressed refer to the same constant direction. In other words, there is no real 
memory analysis (namely Secs. 3.2 and 3.3). Second, the optimizations are 
applied only at an intra-block level and hence many optimization opportunities 
are missed. These two points make a fundamental difference with our approach, 
since detected optimizable patterns (see Sec. 4) require inter-block analysis and 
a precise slot access analysis, and hence cannot be detected by GASOL. 

Finally, as mentioned in Sec. 1, in addition to dynamic memory, smart 
contracts also use a persistent memory called storage. Regarding the application 
of our approach to infer needless accesses in storage, there are two main points. 
First, there is no need to develop a static analysis to detect the slots in storage, as 
they are statically known (hence our inference in Sec. 3.2 and 3.3 is not needed), 
i.e., one can easily know the read and write sets of Def. 6. Thus, the read and 
write sets of our analysis can be easily defined for storage. The second point is 
that, as storage is persistent memory, a write storage access is not removable 
even if there is no further read access within the smart contract, as it needs 
to be stored for a future transaction. The removable write storage accesses are 
only those that are rewritten and not read in-between the two write accesses. 
Including this in our implementation is straightforward. However, this situation 
is rather unusual, and we believe that very few cases would be found and hence 
little optimization can be achieved. 
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Abstract. Model checking undiscounted reachability and expected-re- 
ward properties on Markov decision processes (MDPs) is key for the 
verification of systems that act under uncertainty. Popular algorithms are 
policy iteration and variants of value iteration; in tool competitions, most 
participants rely on the latter. These algorithms generally need worst-case 
exponential time. However, the problem can equally be formulated as 
a linear program, solvable in polynomial time. In this paper, we give a 
detailed overview of today’s state-of-the-art algorithms for MDP model 
checking with a focus on performance and correctness. We highlight 
their fundamental differences, and describe various optimizations and 
implementation variants. We experimentally compare floating-point and 
exact-arithmetic implementations of all algorithms on three benchmark 
sets using two probabilistic model checkers. Our results show that (op- 
timistic) value iteration is a sensible default, but other algorithms are 
preferable in specific settings. This paper thereby provides a guide for 
MDP verification practitioners—tool builders and users alike. 


1 Introduction 


The verification of MDPs is crucial for the design and evaluation of cyber-physical 
systems with sensor noise, biological and chemical processes, network protocols, 
and many other complex systems. MDPs are the standard model for sequential 
decision making under uncertainty and thus at the heart of reinforcement learning. 
Many dependability evaluation and safety assurance approaches rely in some 
form on the verification of MDPs with respect to temporal logic properties. 
Probabilistic model checking [4,5] provides powerful tools to support this task. 
The essential MDP model checking queries are for the worst-case probability 
that something bad happens (reachability) and the expected resource consumption 
until task completion (expected rewards). These are indefinite (undiscounted) 
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horizon queries: They ask about the probability or expectation of a random vari- 
able up until an event—which forms the horizon—but are themselves unbounded. 
Many more complex properties internally reduce to solving either reachability or 
expected rewards. For example, if the description of something bad is in linear 
temporal logic (LTL), then a product construction with a suitable automaton 
reduces the LTL query to reachability [6]. This paper sets out to determine the 
practically best algorithms to solve indefinite horizon reachability probabilities 
and expected rewards; our methodology is an empirical evaluation. 


MDP analysis is well studied in many fields and has lead to three main types 
of algorithms: value iteration (VI), policy iteration (PI), and linear programming 
(LP) [55]. While indefinite horizon queries are natural in a verification context, 
they differ from the standard problem of e.g. operations research, planning, and 
reinforcement learning. In those fields, the primary concern is to compute a 
policy that (often approximately) optimizes the discounted expected reward over 
an infinite horizon where rewards accumulated in the future are weighted by a 
discount factor < 1 that exponentially prefers values accumulated earlier. 


The lack of discounting in verification has vast implications. The Bellman 
operation, essentially describing a one-step backward update on expected re- 
wards, is a contraction with discounting, but not a contraction without. This 
leads to significantly more complex termination criteria for VI-based verification 
approaches [34]. Indeed, VI runs in polynomial time for every fixed discount 
factor [49], and similar results are known for PI as well as LP solving with 
the simplex algorithm [60]. In contrast, VI [9] and PI [20] are known to have 
exponential worst-case behaviour in the undiscounted case. 


So, what is the best algorithm for model checking MDPs? A polynomial-time 
algorithm exists using an LP formulation and barrier methods for its solution [12]. 
LP-based approaches (and their extension to MILPs) are also prominent for 
multi-objective model checking [21], in counterexample generation [23], and 
for the analysis of parametric Markov chains [16]. However, folklore tells us 
that iterative methods, in particular VI, are better for solving MDPs. Indeed, 
variations of VI are the default choice of all model checkers participating in the 
QComp competition [14]. This uniformity may be misleading. Indeed, for some 
stochastic game algorithms, using LP to solve the underlying MDPs may be 
preferential [3, Appendix E.4]. An application in runtime assurance preferred PI 
for numerical stability [45, Sect. 6]. A toy example from [34] is a famous challenge 
for VI-based methods. Despite the prominence of LP, the ease of encoding MDPs, 
and the availability of powerful off-the-shelf LP solvers, many tools did (until 
very recently) not include MDP model checking via LP solvers. 


With this paper, we reconsider the PI and LP algorithms to investigate 
whether probabilistic model checking focused on the wrong family of algorithms. 
We report the results of an extensive empirical study with two independent 
implementations in the model checkers Storm [42] and mcsta [37]. We find that, 
in terms of performance and scalability, optimistic value iteration [40] is a solid 
choice on the standard benchmark collection (which goes beyond competition 
benchmarks) but can be beat quite considerably on challenging cases. We also 
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emphasize the question of precision and soundness. Numerical algorithms, in 
particular ones that converge in the limit, are prone to delivering wrong results. 
For VI, the recognition of this problem has led to a series of improvements over 
the last decade [8,34,40,19,54,56]. We show that PI faces a similar problem. When 
using floating-point arithmetic, additional issues may arise [36,59]. Our use of 
various LP solvers exhibits concerning results for a variety of benchmarks. We 
therefore also include results for exact computation using rational arithmetic. 


Limitations of this study. A thorough experimental study of algorithms requires 
a carefully scoped evaluation. We work with flat representations of MDPs that 
fit completely into memory (i.e. we ignore the state space exploration process 
and symbolic methods). We selected algorithms that are tailored to converge to 
the optimal value. We also exclude approaches that incrementally build and solve 
(partial or abstract) MDPs using simulation or model checking results to guide 
exploration: they are an orthogonal improvement and would equally profit from 
faster algorithms to solve the partial MDPs. Moreover, this study is on algorithms, 
not on their implementations. To reduce the impact of potential implementation 
flaws, we use two independent tools where possible. Our experiments ran on a 
single type of machine—we do not study the effect of different hardware. 


Contributions. This paper contributes a thorough overview on how to model- 
check indefinite horizon properties on MDPs, making MDP model checking more 
accessible, but also pushing the state-of-the-art by clarifying open questions. Our 
study is built upon a thorough empirical evaluation using two independent code 
bases, sources benchmarks from the standard benchmark suite and recent publi- 
cations, compares 10 LP solvers, and studies the influence of various prominent 
preprocessing techniques. The paper provides new insights and reviews folklore 
statements: Particular highlights are a new simple but challenging MDP family 
that leads to wrong results on all floating-point LP solvers (Section 2.3), a nega- 
tive result regarding the soundness of PI with epsilon-precise policy evaluators 
(Section 4), and an evaluation on numerically challenging benchmarks that shows 
the limitations of value iteration in a practical setting (Section 5.3). 


2 Background 


We recall MDPs with reachability and reward objectives, describe solution 
algorithms and their guarantees, and address commonly used optimizations. 


2.1 Markov Decision Processes 


Let Dy := {d: X > [0,1] | Xex d(x) = 1 } be the set of distributions over X. 
A Markov decision process (MDP) [55] is a tuple M = (S, A, 45) with finite sets of 
states S and actions A, and a partially defined transition function 6: S x A — Ds 
such that A(s) := {a | (s,a) € domain(6) } # 0 for all s € S. A(s) is the set of 
enabled actions at state s. 6 maps enabled state-action pairs to distributions over 
successor states. A Markov chain (MC) is an MDP with |A(s)| = 1 for all s. The 
semantics of an MDP are defined in the usual way, see, e.g. [6, Chapter 10]. A 
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(memoryless deterministic) policy—a.k.a. strategy or scheduler—is a function 
am: S — A that, intuitively, given the current state s prescribes what action 
a € A(s) to play. Applying a policy 7 to an MDP induces an MC M™. A path 
in this MC is an infinite sequence p = 5182... with 5(s;,7(s;))(si41) > 0. Paths 
denotes the set of all paths and P7 denotes the unique probability measure of 
M” over infinite paths starting in the state s. 

A reachability objective Pop(T) with set of target states T C S and opt € 
{max, min} induces a random variable X : Paths — [0, 1] over paths by assigning 1 
to all paths that eventually reach the target and 0 to all others. Eopt(rew) denotes 
an expected reward objective, where rew: S — Q>o assigns a reward to each state. 
rew(p) := X<; rew(s;) is the accumulated reward of a path p = s182.. .. This 
yields a random variable X : Paths > QU {oo} that maps paths to their reward. 
For a given objective and its random variable X, the value of a state s € S is the 
expectation of X under the probability measure P7 of the the MC induced by an 
optimal policy m from the set of all policies I, formally V(s) := opte nEs [X]. 


S 


S 


2.2 Solution Algorithms 


Value iteration (VI), e.g. [15], computes a sequence of value vectors converging 
to the optimum in the limit. In all variants of the algorithm, we start with a 
function xz: S — Q that assigns to every state an estimate of the value. The 
algorithm repeatedly performs an update operation to improve the estimates. 
After some preprocessing, this operation has a unique fixpoint when x = V. Thus, 
value iteration converges to the value in the limit. Variants of VI include interval 
iteration [34], sound VI [56] and optimistic VI [40]. We do not discuss these in 
detail, but instead refer to the respective papers. 

Linear programming (LP), e.g. |6, Chapter 10], encodes the transition structure 
of the MDP and the objective as a linear optimization problem. For every state, 
the LP has a variable representing an estimate of its value. Every state-action 
pair is encoded as a constraint on these variables, as are the target set or rewards. 
The unique optimum of the LP is attained if and only if for every state its 
corresponding variable is set to the value of the state. We provide an in-depth 
discussion of theoretical and practical aspects of LP in Section 3. 

Policy iteration (PI), e.g. [11, Section 4], computes a sequence of policies. 
Starting with an initial policy, we evaluate its induced MC, improve the policy by 
switching suboptimal choices and repeat the process on the new policy. As every 
policy improves the previous one and there are only finitely many memoryless 
deterministic policies (a number exponential in the number of states), eventually 
we obtain an optimal policy. We further discuss PI in Section 4. 


2.3 Guarantees 


Given the stakes in many application domains, we require guarantees about the 
relation between an algorithm’s result 0 and the true value v. First, implemen- 
tations are subject to floating-point errors and imprecision [59] unless they use 
exact (rational) arithmetic or safe rounding [36]. This can result in arbitrary 
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Table 1: Correct results 


alg. solver n< 
PI - 20 
LP COPT 18 
CPLEX 18 

Glop 25 

GLPK 24 
Gurobi 18 
HiGHS 22 
Ip_solve 28 
Mosek 22 

Fig. 1: A hard MDP for all algorithms SoPlex 34 


differences between 0 and v. Second are the algorithm’s inherent properties: VI 
is an approximating algorithm that converges to the true value only in the limit. 
In theory, it is possible to obtain the exact result by rounding after exponentially 
many iterations [15]; in practice, this results in excessive runtime. Instead, for 
years, implementations used a naive stopping criterion that could return arbi- 
trarily wrong results [33]. This problem’s discovery sparked the development 
of sound variants of VI [8,34,40,19,54,56], including interval iteration, sound 
value iteration, and optimistic value iteration. A sound VI algorithm guarantees 
-precise results, i.e. |v — U| < £ or |v — v| < v - £. For LP and PI, the guarantees 
have not yet been thoroughly investigated. Theoretically, both are exact, but 
implementations are often not. We discuss the problems in Sections 3 and 4. 
The handcrafted MC of [33, Figure 2] highlights the lack of guarantees 
of VI: standard implementations return vastly incorrect results. We extended 
it with action choices to obtain the MDP M,, shown in Fig. 1 for n € N, 
n > 2. It has 2n + 1 states; we compute Pmin({ n }) and Pmax({n }). The policy 
that chooses action m wherever possible induces the MC of [33, Figure 2] with 
(Pmin({ 7 }), Pmax({7 })) = (4, 4). In every state s with 0 < s < n, we added 
the choice of action j that jumps to n and -n. With that, the (optimal) values 
over all policies are (4, 2). In VI, starting from value 0 for all states except n, 
initially taking j everywhere looks like the best policy for Pmax. As updated 
values slowly propagate, state-by-state, m becomes the optimal choice in all states 
except —n +1. We thus layered a “deceptive” decision problem on top of the slow 
convergence of the original MC. For n = 20, VI with Storm and mcsta deliver the 
incorrect results (0.247, 0.500). For Storm’s PI and various LP solvers, we show in 
Table 1 the largest n for which they return a + 0.01-correct result. For larger n, 
PI and all LP solvers claim = (5, $) as the correct solution except for Glop and 
GLPK which only fail for the maximum at the given n; for the minimum, they 
return the wrong result at n > 29 and 52, respectively. Sound VI algorithms and 
Storm’s exact-arithmetic engine produce (¢-)correct results, though the former at 
excessive runtime for larger n. We used default settings for all tools and solvers. 
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2.4 Optimizations 


VI, LP, and PI can all benefit from the following optimizations: 


Graph-theoretic algorithms can be used for qualitative analysis of the MDP, 
i.e. finding states with value 0 or (only for reachability objectives) 1. These 
qualitative approaches are typically a lot faster than the numerical computations 
for quantitative analysis. Thus, we always apply them first and only run the 
numerical algorithms on the remaining states with non-trivial values. 


Topological methods, e.g. [17], do not consider the whole MDP at once. Instead, 
they first compute a topological ordering of the strongly connected components 
(SCCs)° and then analyze each SCC individually. This can improve the runtime, 
as we decompose the problem into smaller subproblems. The subproblems can 
be solved with any of the solution methods. Note that when considering acyclic 
MDPs, the topological approach does not need to call the solution methods, as 
the resulting values can immediately be backpropagated. 


Collapsing of maximal end components (MECs), e.g., [13,34], transforms the MDP 
into one with equivalent values but simpler structure. After collapsing MECs, 
the MDP is contracting, i.e. we almost surely reach a target state or a state with 
value zero. VI algorithms rely on this property for convergence [34,40,56]. For PI 
and LP, simplifying the graph structure before applying the solution method can 
speed up the computation. 

Warm starts, e.g. [26,46], may adequately initialize an algorithm, i.e., we may 
provide it with some prior knowledge so that the computation has a good starting 
point. We implement warm starts by first running VI for a limited number of 
iterations and using the resulting estimate to guess bounds on the variables in 
an LP or a good initial policy for PI. See Sections 3 and 4 for more details. 


3 Practically solving MDPs using Linear Programs 


This section considers the LP-based approach to solving the optimal policy prob- 
lem in MDPs. To the best of our knowledge, this is the only polynomial-time 
approach. We discuss various configurations. These configuration are a combina- 
tion of the LP formulation, the choice of software, and their parameterization. 


3.1 How to encode MDPs as LPs? 
For objective Pmax(T) we formulate the following LP over variables xs, s € S\ T: 
minimize Xz s.t. lb(s) < £s < ub(s) and 

ses 


Ls > 5 5(s,a)(s’)- as + X (s, a) (t) for all se S\T,a €A 


s'ES\T teT 


5 A set S’ C S is a connected component if for all s,s’ € S’, s can be reached from s’. 
We call S’ strongly connected component if it is inclusion maximal. 
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We assume bounds /b(s) = 0 and ub(s) = 1 for s € S \ T. The unique solution 
n: {£s | s € S\ T} —> [0,1] to this LP coincides with the desired objective 
values n(x) = V (s). Objectives Pmin(T) and Eopt(rew) have similar encodings: 
minimizing policies require maximisation in the LP and flipping the constraint 
relation. Rewards can be added as an additive factor on the right-hand side. For 
practical purposes, the LP formulation can be tweaked. 


The choice of bounds. Any bounds that respect the unique solution will not change 
the answer. That is, any lb and ub with 0 < lb(s) < V (s) < ub(s) yield a sound 
encoding. While these additional bounds are superfluous, they may significantly 
prune the search space. We investigate trivial bounds, e.g., knowing that all 
probabilities are in [0,1], bounds from a structural analysis as discussed by [8], 
and bounds induced by a warm start of the solver. For the latter, if we have 
obtained values V’ < V, e.g., induced by a suboptimal policy, then V’(s) is a 
lower bound on the value zs, which is particularly relevant as the LP minimizes. 


Equality for unique actions. Markov chains, i.e., MDPs where |A| = 1, can be 
solved using linear equation systems. The LP encoding uses one-sided inequalities 
and the objective function to incorporate nondeterministic choices. We investigate 
adding constraints for all states with a unique action. 


ts< X 4(s,a)(s')- a +5  0(s,a)(t) forall s € S\ T with A(s) = {a} 
s'ES\T teT 
These additional constraints may trigger different optimizations in a solver, e.g., 
some solvers use Gaussian elimination for variable elimination. 


A simpler objective. The standard objective assures the solution 7 is optimal for 
every state, whereas most invocations require only optimality in some specific 
states — typically the initial state so or the entry states of a strongly connected 
component. In that case, the objective may be simplified to optimize only the 
value for those states. This potentially allows for multiple optimal solutions: in 
terms of the MDP, it is no longer necessary to optimize the value for states that 
are not reached under the optimal policy. 


Encoding the dual formulation. Encoding a dual formulation to the LP is interest- 
ing for mixed-integer extensions to the LP, relevant for computing, e.g., policies 
in POMDPs [47], or when computing minimal counterexamples [58]. For LPs, due 
to the strong duality, the internal representation in the solvers we investigated is 
(almost) equivalent and all solvers support both solving the primal and the dual 
representation. We therefore do not further consider constructing them. 


3.2 How to solve LPs with existing solvers? 


We rely on the performance of state-of-the-art LP solvers. Many solvers have 
been developed and are still actively advanced, see [2] for a recent comparison 
on general benchmarks. We list the LP solvers that we consider for this work 
in Table 2. The columns summarize for each solver the type of license, whether 
it uses exact or floating-point arithmetic, whether it supports multithreading, 
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Table 2: Available LP solvers (“intr’ = interior point) 


solver version license exact/fp parallel algorithms mcsta Storm 
COPT [24] 5.0.5 academic fp yes intr+simplex yes no 
CPLEX [44] 22.10 academic fp yes intr+simplex yes no 
Gurobi [32] 9.5 academic fp yes intr+simplex yes yes 
GLPK [29] 4.65 GPL fp no intr+simplex no yes 
Glop [30] 9.4.1874 Apache fp no simplex only yes no 
HIGHS [35,43] 1.2.2 MIT fp yes intr+simplex yes no 
Ip_solve [10] 5.5.2.11 LGPL fp no simplex only yes no 
Mosek [52] 10.0 academic fp yes intr+simplex yes no 
SoPlex [28] 6.0.1 academic both no simplex only no yes 
Z3 [53] 4.8.13 MIT exact no simplex only no yes 


and what type of algorithms it implements. We also list whether the solver is 
available from the two model checkers used in this study®. 


Methods. We briefly explain the available methods and refer to [12] for a thorough 
treatment. Broadly speaking, the LP solvers use one out of two families of 
methods. Simplex-based methods rely on highly efficient pivot operations to 
consider vertices of the simplex of feasible solutions. Simplex can be executed 
either in the primal or dual fashion, which changes the direction of progress 
made by the algorithm. Our LP formulation has more constraints than variables, 
which generally means that the dual version is preferable. Interior methods, 
often the subclass of barrier methods, do not need to follow the set of vertices. 
These methods may achieve polynomial time worst-case behaviour. It is generally 
claimed that simplex has superior average-case performance but is highly sensitive 
to perturbations, while interior-point methods have a more robust performance. 


Warm starts. LP-based model checking can be done using two types of warm 
starts. Either by providing a (feasible) basis point as done in [26] or by presenting 
bounds. The former, however, comes with various remarks and limitations, such 
as the requirement to disable preprocessing. We therefore used warm starts only 
by using bounds as discussed above. 


Multithreading. We generally see two types of parallelisation in LP solvers. Some 
solvers support a portfolio approach that runs different approaches and finishes 
with the first one that yields a result. Other solvers parallelize the interior-point 
and/or simplex methods themselves. 


Guarantees for numerical LP solvers. All LP solvers allow tweaking of various 
parameters, including tolerances to manage whether a point is considered feasible 
or optimal, respectively. The experiments in Table 1 already indicate that these 
guarantees are not absolute. A limited experiment indicated that reducing these 
tolerances towards zero did remove some incorrect results, but not all. 


6 Support for Gurobi, GLPK, and Z3 was already available in Storm. Support for Glop 
was already available in mcsta. All other solver interfaces have been added. 
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Exact solving. SoPlex supports exact computations, with a Boost library wrapping 
GMP rationals [22], after a floating-point arithmetic-based startup phase [27]. 
While this combination is beneficial for performance in most settings, it leads to 
crashes for the numerically challenging models. Z3 supports only exact arithmetic 
(also wrapping GMP numbers with their own interface). We observe that the 
price of converting large rational numbers may be substantial. SMT solvers like 
Z3 use a simplex variation [18] tailored towards finding feasible points and in an 
incremental fashion, optimized for problems with a nontrivial Boolean structure. 
In contrast, our LP formulation is easily feasible and is a pure conjunction. 


4 Sound Policy Iteration 


Starting with an initial policy, PI-based algorithms iteratively improve the policy 
based on the values obtained for the induced MC. The algorithm for solving 
the induced MC crucially affects the performance and accuracy of the overall 
approach. This section addresses the solvers available in Storm, possible precision 


issues, and how to utilize a warm start, while Section 5 discusses PI performance’. 


Markov chain solvers. To solve the induced MC, Storm can employ all linear 
equation solvers listed in [42] and all implemented variants of VI. In our experi- 
ments, we consider (i) the generalized minimal residual method (GMRES) [57] 
implemented in GMM++ [25], (ii) VI [15] with a standard (relative) termination 
criterion, (iii) optimistic VI (OVI) [40], and (iv) the sparse LU decomposition 
implemented in Eigen [31] using either floating-point or exact arithmetic (LU*). 
LU and LU* provide exact results (modulo floating-point errors in LU) while 
OVI yields e-precise results. VI and GMRES do not provide any guarantees. 


Correctness of PI. The accuracy of PI is affected by the MC solver. Firstly, PI 
cannot be more precise than its underlying solver: the result of PI has the same 
precision as the result obtained for the final MC. Secondly, inaccuracies by the 
solver can hide policy improvements; this may lead to premature convergence with 
a sub-optimal policy. We show that PI can return arbitrarily wrong results—even 
if the intermediate results are €-precise: 

Consider the MDP in Fig. 2 with objective 
Pmax({ G }). There is only one nondeterministic choice, 
namely in state so. The optimal policy is to pick b, 
obtaining a value of 0.5. Picking a only yields 0.1. How- 
ever, when starting from the initial policy 7(s9) = a, 
an e-precise MC solver may return 0.1 + € for both so 
and sı and ĉ/2 + (1 — ô) - 0.1 for sg. This solution is 
indeed e-precise. However, when evaluating which action to pick in so, we can 
choose 6 such that a seems to obtain a higher value. Concretely, we require 
5/2 + (1—6)-0.1 < 0.1 + £. For every ¢ > 0, this can be achieved by setting 
ô < 2.5-¢. In this case, PI would terminate with the final policy inducing a 
severely suboptimal value. 


Fig. 2: Example MDP 


7 [46] addresses performance in the context of PI for stochastic games. 
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If every Markov chain is solved precisely, PI is correct. Indeed, it suffices to be 
certain that one action is better than all others. This is the essence of modified 
policy iteration as described in [55, Chapters 6.5 and 7.2.6]. Similarly, [46, Section 
4.2] suggests to use interval iteration when solving the system induced by the 
current policy and stopping when the under-approximation of one action is higher 
than the over-approximation of all other actions. 


Warm starts. PI profits from being provided a good initial policy. If the initial 
policy is already optimal, PI terminates after a single iteration. We can inform 
our choice of the initial policy by providing estimates for all states as computed 
by VI. For every state, we choose the action that is optimal according to the 
estimate. This is a good way to leverage VI’s ability to quickly deliver good 
estimates [40], while at the same time providing the exactness guarantees of PI. 


5 Experimental Evaluation 


To understand the practical performance of the different algorithms, we performed 
an extensive experimental evaluation. We used three sets of benchmarks: all 
applicable benchmark instances from the Quantitative Verification Benchmark 
Set (QVBS) [41] (the qubs set), a subset of hard QVBS instances (the hard set), 
and numerically challenging models from a runtime monitoring application [45] 
(the premise set, named for the corresponding prototype). We consider two prob- 
abilistic model checkers, Storm [42] and the Modest Toolset’s [37] mcsta. We used 
Intel Xeon Platinum 8160 systems running 64-bit CentOS Linux 7.9, allocating 4 
CPU cores and 32GB RAM to each experiment unless noted otherwise. 

We plot algorithm runtimes in seconds in quantile plots as on the left and 
scatter plots as on the right of Fig. 3. The former compare multiple tools or con- 
figurations; for each, we sort the instances by runtime and plot the corresponding 
monotonically increasing line. Here, a point (x,y) on the a-line means that the 
z-th fastest instance solved by a took y seconds. The latter compare two tools 
or configurations. Each point (x,y) is for one benchmark instance: the x-axis 
tool took x while the y-axis tool took y seconds to solve it. The shape of points 
indicates the model type; the mapping from shapes to types is the same for all 
scatter plots and is only given explicitly in the first one in Fig. 3. Additional 
plots to support the claims in this section are provided in the appendix of the 
full version [39] of this paper. 

The depicted runtimes are for the respective algorithm and all necessary 
and/or stated preprocessing, but do not include the time for constructing the 
MDP state spaces (which is independent of the algorithms). mcsta reports all 
time measurements rounded to multiples of 0.1s. We summarize timeouts, out- 
of-memory, errors, and incorrect results as “n/a”. Our timeout is 30 minutes for 
the algorithm and 45 minutes for total runtime including MDP construction. We 
consider a result incorrect if |v — U| > v- 107? (i.e. relative error 107°) whenever 
a reference result v is available. We however do not flag a result as incorrect if 


8 A benchmark instance is a combination of model, parameter valuation, and objective. 
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Fig. 3: Comparison of LP solver runtime on the qubs set 


v and 0 are both below 1078 (relevant for the premise set). Nevertheless, we 
configure the (unsound) convergence threshold for VI as 107° relative; among the 
sound VI algorithms, we include OVI, with a (sound) stopping criterion of relative 
10~® error. To only achieve the 1073 precision we actually test, OVI could thus 
be even faster than it appears in our plots. We make this difference to account 
for the fact that many algorithms, including the LP solvers, do not have a sound 
error criterion. We mark exact algorithms/solvers that use rational arithmetic 
with a superscript *. The other configurations use floating-point arithmetic (fp). 


5.1 The QVBS Benchmarks 


The qubs set comprises all QVBS benchmark instances with an MDP, Markov 
automaton (MA), or probabilistic timed automaton (PTA) model? and a reacha- 
bility or expected reward/time objective that is quantitative, ie. not a query that 
yields a zero or one probability. We only consider instances where both Storm 
and mcsta can build the explicit representation of the MDP within 15 minutes. 
This yields 367 instances. We obtain reference results for 344 of them from either 
the QVBS database or by using one of Storm’s exact methods. We found all 
reference results obtained via different methods to be consistent. 

For LP, we have various solvers with various parameters each, cf. Section 3. For 
conciseness, we first compare all available LP solvers on the qubs set. For the best- 
performing solver, we then evaluate the benefit of different solver configurations. 
We do the same for the choice of Markov chain solution method in PI. We then 
focus on these single, reasonable, setups for LP and PI each in more detail. 


LP solver comparison. The left-hand plot of Fig. 3 summarizes the results of 
our comparison of the different LP solvers. Subscripts s and m indicate whether 
the solver is embedded in either Storm or mcsta. We apply no optimizations or 


? MA and PTA are converted to MDP via embedding and digital clocks [48]. 
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reductions to the MDPs except for the precom- Table 3: LP summary 


putation of probability-0 states (and in Storm 
also of probability-1 states), and use the default solver correct incorr. no result 


settings for all solvers, with the trivial variable v 359 3 0 
bounds [0, 1] and [0, 00) for probabilities and ex-  VIm 357 8 2 
: $ COPT m 312 12 43 

pected rewards, respectively. We include VI as — cpLex., 291 10 66 
baseline. In Table 3, we summarize the results. GloPm 257 4 106 
a GLPK, 199 5 163 

In terms of performance and scalability, — Gurobi, 331 4 39 
Gurobi solves the highest number of benchmarks Gurobim 323 4 40 
. ; : HiGHSm 288 10 69 
in any given time budget, closely followed by Ip solve 209 0 158 
COPT. CPLEX, HiGHS, and Mosek make up a Mosekm 287 15 65 
middle-class group. While the exact solver Z3 is See 3 e 
very slow, SoPlex’s exact mode actually competes z3% ` 148 0 219 


with some fp solvers. However, the quantile plots 
do not tell the whole story. On the right of Fig. 3, we compare COPT and Gurobi 
directly: each has a large number of instances on which it is (much) better. 

In terms of reliability of results, the exact solvers as expected produce no 
incorrect results; so does the slowest fp solver, Ip_ solve. COPT, CPLEX, HiGHS, 
Mosek, and fp-SoPlex perform badly in this metric, producing more errors than 
VI. Interestingly, these are mostly the faster solvers, the exception being Gurobi. 

Overall, Gurobi achieves highest performance at decent reliability; in the 
remainder of this section, we thus use Gurobi, whenever we apply non-exact LP. 


LP solver tweaking. Gurobi can be configured to use an “auto” portfolio approach, 
potentially running multiple algorithms concurrently on multiple threads, a primal 
or a dual simplex algorithm, or a barrier method algorithm. We compared each 
option with 4 threads and found no significant performance difference. Similarly, 
running the auto method with 1, 4, and 16 threads (only here, we allocate 16 
threads per experiment) also failed to bring out noticeable performance differences. 
Using more threads results in a few more out-of-memory errors, though. We thus 
fix Gurobi on auto with 4 threads. 

Fig. 4 shows the performance impact of supplying Gurobi with more precise 
bounds on the variables for expected reward objectives using methods from 
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Fig. 5: Comparison of MDP model checking algorithms on the qubs set 


[8,51] “bounds” instead of “simple”), of optimizing only for initial state (“init”) 
instead of the sum over all states (“all”), and of using equality (“eq”) instead of 
less-/greater-than-or-equal (“ineq”) for unique action states. More precise bounds 
yield a very small improvement at essentially no cost. Optimizing for the initial 
state only results in a little better overall performance (in the “pocket” in the 
quantile plot around x = 315 that is also clearly visible in the scatter plot). 
However, it also results in 2 more incorrect results in the qubs set. Using equality 
for unique actions noticeably decreases performance and increases the incorrect 
result count by 9 instances. For all experiments that follow, we thus use the more 
precise bounds, but do not enable the other two optimizations. 


PI methods comparison. The main choice in 
= PI/gmres 


PI is which algorithm to use to solve the —PI/VI 
induced Markov chains. On the right, we ||-— Pr /ovI 
show the performance of the different algo- = PI/LU 
rithms available in Storm (cf. Section 4). LU* — PI/LU* 


yields a fully exact PI. This interestingly 
performs better than the fp version, poten- 
tially because fp errors induce spurious policy 
changes. The same effect likely also hinders the use of OVI, whereas VI leads 
to good performance. Nevertheless, gmres is best overall, and thus our choice 
for all following experiments with non-exact PI. VI and gmres yield 6 and 4 
incorrect results, respectively. OVI and the exact methods are always correct on 
this benchmark set. 


Best MDP algorithms for QVBS. We now compare all MDP model checking 
algorithms on the qubs set: with floating-point numbers, LP and PI configured as 
described above, plus unsound VI, sound OVI, and the warm-start variants of PI 
and LP denoted “VI2PI” and “VI2LP”, respectively. Exact results are provided 
by rational search (RS, essentially an exact version of VI) [50], PI with exact LU, 
and LP with exact solvers (SoPlex and Z3). All are implemented in Storm. 

In a first experiment, we evaluated the impact of using the topological 
approach and of collapsing MECs (cf. Section 2.4). The results, for which we 
omit plots, are that the topological approach noticeably improves performance 
and scalability for all algorithms, and we therefore always use it from now on. 
Collapsing MECs is necessary to guarantee termination of OVI, while for the 


0 100 200 300 
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Fig. 7: Comparison of MDP model checking algorithms on the hard subset 


other algorithms it is a potential optimization; however we found it to overall 
have a minimal positive performance impact only. Since it is required by OVI 
and does not reduce performance, we also always use it from now on. 

Fig. 5 shows the complete comparison of all the methods on the qubs set, 
for fp algorithms on the left and exact solutions on the right. Among the fp 
algorithms, OVI is clearly the fastest and most scalable. VI is somewhat faster 
but incurs several incorrect results that diminish its appearance in the quantile 
plot. OVI is additionally special among these algorithms in that it is sound, i.e. 
provides guaranteed ¢-correct results—though up to fp rounding errors, which 
can be eliminated following the approach of [36]. On the exact side, PI with 
an inexact-VI warm start works best. The scatter plots in Fig. 6(a) shows the 
performance impact of computing an exact instead of an approximate solution. 


5.2 The Hard QVBS Benchmarks 


The QVBS contains many models built for tools that use VI as default algorithm. 
The other algorithms may actually be important to solve key challenging instances 
where VI/OVI perform badly. This contribution could be hidden in the sea of 
instances trivial for VI. We thus zoom in on a selection of QVBS instances that 
appear “hard” for VI: those where VI takes longer than the prior MDP state 
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Fig. 8: Comparison of MDP model checking algorithms on the premise set 


space construction phase in both Storm and mesta, and additionally both phases 
together take at least 1s. These are 18 of the previously considered 367 instances. 

In Fig. 7, we show the behaviour of all the algorithms on this hard subset. OVI 
again works better than VI due to the incorrect results that VI returns. We see 
that the performance and scalability gap between the algorithms has narrowed; 
although OVI still “wins”, LP in particular is much closer than on the full qubs set. 
We also investigated the LP outcomes with solvers other than Gurobi: even on this 
set, Gurobi and COPT remain the fastest and most scalable solvers. With mcsta, 
in the basic configuration, they solve 16 and 17 instances, the slowest taking 
835s and 1334s, respectively; with the topological optimization, the numbers 
become 17 and 15 instances with the slowest at 1373s and 1590s seconds. We 
show the detailed comparison of OVI and LP in Fig. 6(c), noting that there are 
a few instances where LP is much faster, and repeat the comparison between the 
best fp and exact algorithms (Fig. 6(b)). 


5.3 The Runtime Monitoring Benchmarks 


While the QVBS is intentionally diverse, our third set of benchmarks is inten- 
tionally focused: We study 200 MDPs from a runtime monitoring study [45]. The 
original problem is to compute the normalized risk of continuing to operate the sys- 
tem being monitored subject to stochastic noise, unobservable and uncontrollable 
nondeterminism, and partial state observations. This is a query for a conditional 
probability. It is answered via probabilistic model checking by unrolling an MDP 
model along an observed history trace of length n € {50,...,1000} following 
the approach of Baier et al. [7]. The MDPs contain many transitions back to the 
initial state, ultimately resulting in numerically challenging instances (containing 
structures similar to the one of Mp in Section 2.3). We were able to compute a 
reference result for all instances. 

Fig. 8 compares the different MDP model checking algorithms on this set. In 
line with the observations in [45], we see very different behaviour compared to 
the QVBS. Among the fp solutions on the left, LP with Gurobi terminates very 
quickly (under 1s), and either produces a correct (155 instances) or a completely 
incorrect result (mostly 0, on 45 instances). VI behaves similarly, but is slower. 
OVI, in contrast, delivers no incorrect result, but instead fails to terminate on all 
but 116 instances. In the exact setting, warm starts using VI inherit its relative 
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slowness and consequently do not pay off. Exact PI outperforms both exact LP 
solvers. In the case of exact SoPlex, out of the 112 instances it does not manage 
to solve, 98 are crashes likely related to a confirmed bug in its current version. 

The premise set highlights that the best MDP model checking algorithm 
depends on the application. Here, in the fp case, LP appears best but produces 
unreliable (incorrect) results; the seemingly much worse OVI at least does not 
do so. Given the numeric challenge, an exact method should be chosen, and we 
show that these actually perform well here. 


6 Conclusion 


We thoroughly investigated the state of the art in MDP model checking, showing 
that there is no single best algorithm for this task. For benchmarks which are 
not numerically challenging, OVI is a sensible default, closely followed by PI and 
LP with a warm start—although using the latter two means losing soundness as 
confirmed by a number of incorrect results in our experiments. For numerically 
hard benchmarks, PI and LP as well as computing exact solutions are more 
attractive, and clearly preferable in combination. Overall, although LP has the 
superior (polynomial) theoretical complexity, in our practical evaluation, it almost 
always performs worse than the other (exponential) approaches. This is even 
though we use modern commercial solvers and tune both the LP encoding of the 
problem as well as the solvers’ parameters. While we observed the behaviour of 
the different algorithms and have some intuition into what makes the premise 
set hard, an entire research question of its own is to identify and quantify the 
structural properties that make a model hard. 

Our evaluation also raises the question of how prevalent MDPs that challenge 
VI are in practice. Aside from the premise benchmarks, we were unable to find 
further sets of MDPs that are hard for VI. Notably, several stochastic games (SGs) 
difficult for VI were found in [46]; the authors noted that using PI for the SGs 
was better than applying VI to the SGs. However, when we extracted the induced 
MDPs, we found them all easy for VI. Similarly, [3] used a random generation 
of SGs of at most 10,000 states, many of which were challenging for the SG 
algorithms. Yet the same random generation modified to produce MDPs delivered 
only MDPs easily solved in seconds, even with drastically increased numbers 
of states. In contrast, Alagöz et al. [1] report that their random generation 
returned models where LP beat PI. However, their setting is discounted, and 
their description of the random generation was too superficial for us to be able 
to replicate it. We note that, in several of our scatter plots, the MA instances 
from the QVBS (where we check the embedded MDP) appeared more challenging 
overall than the MDPs. We thus conclude this paper with a call for challenging 
MDP benchmarks—as separate benchmark sets of unique characteristics like 
premise, or for inclusion in the QVBS. 


Data availability statement. The datasets generated and analysed in this 
study and code to regenerate them are available in the accompanying artifact [38]. 
For Storm, our code builds on version 1.7.0. We used mcsta version 3.1.213. 
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Abstract. A classical problem for Markov chains is determining their 
stationary (or steady-state) distribution. This problem has an equally 
classical solution based on eigenvectors and linear equation systems. 
However, this approach does not scale to large instances, and iterative 
solutions are desirable. It turns out that a naive approach, as used by 
current model checkers, may yield completely wrong results. We present 
a new approach, which utilizes recent advances in partial exploration and 
mean payoff computation to obtain a correct, converging approximation. 


1 Introduction 


Discrete-time Markov chains (MCs) are an elegant and standard framework to 
describe stochastic processes, with a vast area of applications such as computer 
science [4], biology [28], epidemiology [13], and chemistry [12], to name a few. 
In a nutshell, MC comprise a set of states and a transition function, assigning 
to each state a distribution over successors. The system evolves by repeatedly 
drawing a successor state from the transition distribution of the current state. 
This can, for example, model communication over a lossy channel, a queuing 
network, or populations of predator and prey which grow and interact randomly. 
For many applications, the stationary distribution of such a system is of particular 
interest. Intuitively, this distribution describes in which states the system is in 
after an “infinite” number of steps. For example, in a chemical reaction network 
this distribution could describe the equilibrium states of the mixture. 

Traditionally, the stationary distribution is obtained by computing the domi- 
nant eigenvector for particular matrices and solving a series of linear equation 
systems. This approach is appealing in theory, since it is polynomial in the size 
of the considered Markov chain. Moreover, since linear algebra is an intensely 
studied field, many optimizations for the computations at hand are known. 

In practice, these approaches however often turn out to be insufficient. Real- 
world models may have millions of states, often ruling out exact solution ap- 
proaches. As such, the attention turns to iterative methods. In particular, the 
popular model checker PRISM [21] employs the power method (or power iteration) 
to approximate the stationary distribution. Similar to many other problems on 
Markov chains, such iterative methods have an exponential worst-case, however 
obtain good results quickly on many models. (Models where iterative methods 
indeed converge slowly are called stiff.) However, as we show in this work, the 
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“absolute change”-criterion used by PRISM to stop the iteration is incorrect. In 
particular, the produced results may be arbitrarily wrong already on a model 
with only four states. In [14,7] the authors discuss a similar issue for the problem 
of reachability, also rooted in an incorrect absolute change stopping criterion, and 
provide a solution through converging lower and upper bounds. In our case, the 
situations is more complicated. The convergence of the power method is quite 
difficult to bound: A good (and potentially tight) a-priori bound is given by 
the ratio of first and second eigenvalues, which however is as hard to determine 
as solving the problem itself. In the case of MC, only a crude bound on this 
ratio can be obtained easily, which gives an exponential bound on the number of 
iterations required to achieve a given precision. More strikingly, in contrast to 
reachability, there is to our knowledge no general adaptive stopping criterion for 
power iteration, i.e. a way to check whether the current iterates are already close 
to the correct result. Thus, one would always need to iterate for as many steps 
as given by the a-priori bound to obtain guarantees on the result. In summary, 
exact solution approaches do not scale well, and the existing iterative approach 
may yield wrong results or requires an intractable number of steps. 

Another, orthogonal issue of the mentioned approaches is that they construct 
the complete system, i.e. determine the stationary distribution for each state. 
However, when we figure out that, for example, the stationary distribution has 
a value of at least 99% for one state, all other states can have at most 1% in 
total. In case we are satisfied with an approximate solution, we could already 
stop the computation here, without investigating any other state. Inspired by the 
results of [7,18], we thus also want to find such an approximate solution, capable 
of identifying the relevant parts of the system and only constructing those. 


1.1 Contributions 
In this work, we address all the above issues. To this end, we 


— provide a characterization of the stationary distribution through mean payoff 
which allows us to obtain provably correct approximations (Section 3), 

— introduce a general framework to approximate the stationary distribution in 
Markov chains, capable of utilizing partial exploration approaches (Section 4), 

— as the main technical contribution, provide very general, precise correctness 
and termination proofs, requiring only minimal assumptions (Theorem 3), 

— instantiate this framework with both the classical solution approach as well 
as our novel sampling-based interval approximation approach (Section 4.2), 

— evaluate the variants of our framework experimentally (Section 5), and 

— demonstrate with a minimal example that the standard approach of PRISM 
may yield arbitrarily wrong results (Fig. 2). 


1.2 Related Work 


Most related is the work of [30], which also try to identify the most relevant 
parts of the system, however they employ the special structure given by cellular 
processes to find these regions and estimate the subsequent approximation 
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error. Many other works deal with special cases, such as queueing models [1,17], 
time-reversible chains [8], or positive rows (all states have a transition to one 
particular state) [9,11,27]. In contrast, our methods aim to deal with general 
Markov chains. We highlight that for the “positive row” case, [11] also provides 
converging bounds, however through a different route. Another topic of interest 
are continuous time Markov chains, where abstraction- and truncation-based 
algorithms are applicable [20,3] and computation of the stationary distribution 
can be used for time-bounded reachability [16]. 


2 Preliminaries 


As usual, N and R refer to the (positive) natural numbers and real numbers, 
respectively. For a set S, S denotes its complement, while S* and S“ refer to the 
set of finite and infinite sequences comprising elements of S, respectively. We 
write 1g(s) = 1 if s € S and 0 otherwise for the characteristic function of S. 

We assume familiarity with basic notions of probability theory, e.g., probability 
spaces, probability measures, and measurability; see e.g. [6] for a general introduc- 
tion. A probability distribution over a countable set X is a mapping d : X —> [0,1], 
such that ` ex d(x) = 1. Its support is denoted by supp(d) = {x € X | d(x) > 0}. 
D(X) denotes the set of all probability distributions on X. Some event happens 
almost surely (a.s.) if it happens with probability 1. 

The central object of interest are Markov chains, a classical model for systems 
with stochastic behaviour: A (discrete-time time-homogeneous) Markov chain 
(MC) is a tuple M = (S, ô), where S is a finite set of states, and 6: S + D(S) is 
a transition function that for each state s yields a probability distribution over 
successor states. We deliberately exclude the explicit definition of an initial state. 
We direct the interested reader to, e.g., [4, Sec. 10.1], [29, App. A], or [19] for 
further information on Markov chains and related notions. 

For ease of notation, we write (s, s’) instead of 5(s)(s’), and, given a function 
f: S + R mapping states to real numbers, we write 6(s)(f) = X ves 6(s, 8’) > 
f(s’) to denote the weighted sum of f over the successors of s. 

We always assume an arbitrary but fixed numbering of the states and identify 
a state with its respective number. For example, given a vector v € RIS! anda 
state s E€ S, we may write v[s] to denote the value associated with s by v. In this 
way, a function v : S > R is equivalent to a vector v € RI'I, 

For a set of states R C S where no transitions leave R, i.e. 6(s,s’) = 0 for all 
s E€ R, s' € S\ R, we define the restricted Markov chain M|r = (R, ô| r) with 
ô|r : R + D(R) copying the values of 0, i.e. d|r(s, s") = 6(s, 8’) for all s,s’ € R. 


Paths An infinite path p in a Markov chain is an infinite sequence p = s159°-: E€ 
S”, such that for every i € N we have that ô(s;, si+1) > 0. We use p(i) to refer to 
the i-th state s; in a given infinite path. We denote the set of all infinite paths of 
a Markov chain M by Pathsy. Observe that in general Pathsy is a proper subset 
of S”, as we imposed additional constraints. A Markov chain together with an 
initial state § € S' induces a unique probability measure Pry, over infinite paths 
4, Sec. 10.1]. Given a measurable random variable f : Pathsm —> R, we write 
Em, lf] = df pais f(p) dPrm,s to denote its expectation w.r.t. this measure. 
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Reachability An important tool in the following is the notion of reachability 
probability, i.e. the probability that the system, starting from a state §, will 
eventually reach a given set T. Formally, for a Markov chain M and set of states 
T, we define the set of runs which reach T (i) at step n by O="T := {p € Pathsy | 
p(n) € T} and (ii) eventually by OT = U72, O='T. (For a measurability proof see 
e.g. [4, Chp. 10].) For a state 8, the probability to reach T is given by Prm,s[7). 
Classically, the reachability probability can be determined by solving a linear 
equation system, as follows. For a fixed target set T, let So be all states that 
cannot reach T. Note that Sọ can be determined by simple graph analysis. Then, 
the reachability probability Pry,s[O7] is the unique solution of [4, Thm. 10.19] 


f(s)=lifseET, OifseSo, and 06(s)(f) otherwise. (1) 


Value Iteration A classical tool to deal with Markov chains is value iteration (V1) 
[5]. It is a simple yet surprisingly efficient and extendable approach to solve a 
variety of problems. At its heart, VI relies, as the name suggests, on iteratively 
applying an operation to a value vector. This operation often is called “Bellman 
backup” or “Bellman update”, usually derived from a fixed-point characterization 
of the problem at hand. Thus, VI often can be viewed as fixed point iteration. 
For reachability, inspired by Eq. (1), we start from v;[s] = 0 and iterate 


veti[s] =lifsEeT, Oifse So, and  6(s)(v~) otherwise. (2) 


This iteration monotonically converges to the true value in the limit from below 
[4, Thm. 10.15], [29, Thm. 7.2.12]. Convergence up to a given precision may 
take exponential time [14, Thm. 3], but in practice VI often is much faster than 
methods based on equation solving. For further details, see [26, App. A.2]. 


Strongly Connected Components A non-empty set of states C C S in a Markov 
chain is strongly connected if for every pair s,s’ € C there is a non-empty finite 
path from s to s’. Such a set C is a strongly connected component (SCC) if it 
is inclusion maximal, i.e. there exists no strongly connected C’ with C ¢ ©”. 
SCCs are disjoint, each state belongs to at most one SCC. An SCC is bottom 
(BSCC) if additionally no path leads out of it, i.e. for all s € C,s’ € S\ C we 
have 6(s,s’) = 0. The set of BSCCs in an MC M is denoted by BSCC(M) and 
can be determined in linear time by, e.g., Tarjan’s algorithm [32]. 

The bottom components fully capture the limit behaviour of any Markov 
chain. Intuitively, the following statement says that (i) with probability one a 
run of a Markov chain eventually forever remains inside one single BSCC, and 
(ii) inside a BSCC, all states are visited infinitely often with probability one. 


Lemma 1 ([4, Thm. 10.27]). For any MC M and state s, we have 
Prus[{o | IR; € BSCC(M).Ino € N.Vn > no.p(n) € Ri} = 1. 
For any BSCC R € BSCC(M) and states s,s’ € R, we have Pru,s[O{s’}] = 1. 


Stationary Distribution Given a state §, the stationary distribution (also known 
as steady-state or long-run distribution) of a Markov chain intuitively describes, 
for each state s, the probability for the system to be at this particular state at an 
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Fig. 1. Example MC to demonstrate the stationary distribution. We have that my, = 
{p > iso 0, qm -ige 5° ae 


arbitrarily chosen step “at infinity”. There are several ways to define this notion. 
In particular, there is a subtle difference between the limiting and stationary 
distribution, which however coincide for aperiodic MC. For the sake of readability, 
we omit this distinction and assume w.l.o.g. that all MCs we deal with are 
aperiodic. See [26, App. A.1] for further discussion. Our definition follows the 
view of [4, Def. 10.79]; see [29, Sec. A.4] for a different approach. 


Definition 1. Fix a Markov chain M = (S,6) and initial state 8. Let my (s) = 
Prus[O7>"{s}] the probability that the system is at state s in step n. Then, 
TH a(8) = limn+oo + Sya TMm,s(S) is the stationary distribution of M. 

See Fig. 1 for an example. Whenever the reference is clear from context, we omit 
the respective subscripts from nA s- 

We briefly recall the classical approach to compute stationary distributions 
(see e.g. [19, Sec. 4.7]). By Lemma 1, almost all runs eventually end up in a BSCC. 
Thus, 7°°(s) = 0 for all states s not in a BSCC, or, dually, } 2 ,ep T° (s) = 1 
for B = Upregscom) R- Moreover, once in a BSCC, we always obtain the 
same stationary distribution, irrespective of through which state we entered the 
BSCC. Formally, for each BSCC R € BSCC(M) and s,s’ € R, we have that 
TMs = TMs! = TM|r,» Le each BSCC R has a unique stationary distribution, 
which we denote by n. Note that supp(7?) = R, i.e. n% (s) #0 if and only if 
s € R. Together, we observe that the stationary distribution of a Markov chain 
decomposes into (i) the steady state distribution in each BSCC and (ii) the 
probability to end up in a particular BSCC. More formally, for any state s € S 


me (8) = > pensoogm P'™M4lOR] - 129 (8). (3) 


Consider the example of Fig. 1: We have two BSCCs, {p} and {q, q2}, which 
both are reached with probability z, respectively. The overall distribution my s 
then is obtained from m? = {p > 1} and T? q} = {a > ło > B}. 

As mentioned, we can compute reachability probabilities in Markov chains by 
solving Eq. (1). Thus, the remaining concern is to compute 7%, i.e. the stationary 
distribution of M|pr. In this case, i.e. Markov chains comprising a single BSCC, 
the steady state distribution is the unique fixed point of the transition function 
(up to rescaling). By defining the row transition matrix of M as P;,; = ô(i, j), 
we can reformulate this property in terms of linear algebra. In particular, we 
have that P-n% = mf, or, in other words, (P — I): a = 0, where J is an 
appropriately sized identity matrix [29, Thm. A.2]. This equation again can be 
solved by classical methods from linear algebra. In summary, we (i) compute 
BSCC(M), (ii) for each BSCC R, compute 79 and Pru,s[OR], and (iii) combine 
according to Eq. (3). 
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However, as also mentioned in the introduction, precisely solving linear 
equation systems may not scale well, both due to time as well as memory 
constraints. Thus, we also are interested in relaxing the problem slightly and 
instead approximating the stationary distribution up to a given precision of € > 0. 


Problem Statement Given a Markov chain M and precision requirement 
£ > 0, compute bounds l, u : S — [0,1] such that (i) maxseg u(s) — U(s) < € 
and (ii) for all s € S we have I(s) < nù (s) < u(s). 


Approximate Solutions Aiming for approximations is not a new idea; to achieve 
practical performance, current model checkers employ approximate, iterative 
methods by default for most queries (typically a variant value iteration). In 
particular, this also is the case for stationary distribution: Instead of solving the 
equation system for each BSCC R precisely, we can approximate the solution by, 
e.g., the power method. This essentially means to repeatedly apply the transition 
matrix (of the model restricted to the BSCC) to an initial vector vp, i.e. iterating 
Un41 = Pr: Un (or Un41 = PR: v1). Similarly, the reachability probability for 
each BSCC then also is approximated by value iteration. 

It is known that (for aperiodic MC) limpo Un = mF (see e.g. [31,16,27]), 
however convergence up to a precision of € may take exponential time in the 
worst case. Moreover, there is no known stopping criterion which allows us to 
detect that we have converged and stop the computation early. Yet, similar to 
reachability [7,14], current model checkers employ this method without a sound 
stopping criterion, leading to potentially arbitrarily wrong results, as we show in 
our evaluation (Fig. 2). See [16] for a related, in-depth discussion of these issues 
in the context of CTMC. 

We thus want to find efficient methods to derive safe bounds on the station- 
ary distribution of a BSCC with a correct stopping criterion and combine it 
with correct reachability approximations to obtain an overall fast and sound 
approximation. To this end, we exploit two further concepts. 


Partial Exploration Recent works [7,2,18,24] demonstrate the applicability of 
partial exploration to a variety of problems associated with probabilistic systems 
such as reachability. Essentially, the idea is to “omit” parts of the system which 
can be proven to be irrelevant for the result, instead focussing on important areas 
of the system. Of course, by omitting parts of the system, we may incur a small 
error. As such, these approaches naturally aim for approximate solutions. 


Mean payoff We make use of another property, namely mean payoff (also known 
as long-run average reward). We provide a brief overview and direct to e.g. 
[29, Chp. 8 & 9] or [2] for more information. Mean payoff is specified by a 
Markov chain and a reward function r : S — R, assigning a reward to each state. 
Given an infinite path p = s1s2---, this naturally induces a stream of rewards 
r(p) = r(s1)r(s2)-+:. The mean payoff of this path then equals the average 


reward obtained in the limit, mp/.(p) := liminf,_,.. + 77, r(si). (The limit 
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might not be defined for some paths, hence considering the lim inf is necessary.) 
Finally, the mean payoff of a state s is the expected mean payoff according to 
Prm,s, ie. mp,(s) := Em,s[mp;]. 

Classically, mean payoff is computed by solving a linear equation system [29, 
Thm. 9.1.2]. Instead, we can also employ value iteration to approximate the 
mean payoff, however with a slight twist. We iteratively compute the expected 
total reward, i.e. the expected sum of rewards obtained after n steps, by iterating 
Un+1(s) = r(s)+ (s) (vn). It turns out that the increase An(s) = Un+i(s) — Un(s) 
approximates the mean payoff, i.e. mp,(s) = limp. An(s) [29, Thm. 9.4.5 
a)]. Moreover, we have minses An(s') < mp, (s) < maxses An(s'), yielding a 
correct stopping criterion [29, Thm. 9.4.5 b)]. Finally, on BSCCs these upper and 
lower bounds always converge [29, Cor. 9.4.6 b)], yielding termination guarantees. 
We provide further details on VI for mean payoff in [26, App. A.3]. 


3 Building Blocks 


To arrive at a practical algorithm approximating the stationary distribution, we 
propose to employ sampling-based techniques, inspired by, e.g. [7,2,18]. Intuitively, 
these approaches repeatedly sample paths and compute bounds on a single 
property such as reachability or mean payoff. The sampling is designed to follow 
probable paths with high probability, hence the computation automatically 
focuses on the most relevant parts of the system. Additionally, by building the 
system on the fly, construction of hardly reachable parts of the system may be 
avoided altogether, yielding immense speed-ups for some models (see, e.g., [18] for 
additional background). We apply a series of tweaks to the original idea to tailor 
this approach to our use case, i.e. approximating the stationary distribution. 

In this section, we present the “building blocks” for our approximate approach. 
In the spirit of Eq. (3), we discuss how we handle a single BSCC and how to 
approximate the reachability probabilities of all BSCCs. In the following section, 
we then combine these two approaches in a non-trivial manner. 


3.1 Bounds in BSSCs through Mean Payoff 


It is well known that the mean payoff can be computed directly from the stationary 
distribution [29, Prop. 8.1.1], namely: 


mp,(s) = ca ™s(s’) “r(s') (4) 


In this section, we propose the opposite, namely computing the stationary 
distribution of a BSCC through mean payoff queries. Fix a Markov chain M = 
($,6) which comprises a single BSCC, i.e. S € BSCC(M), and define r(s’) = 
1,5}(s’), i.e. 1 for s and 0 otherwise. Then, the mean payoff corresponds to the 
frequency of s appearing, i.e. the stationary distribution. Formally, we have that 
TM (s) = mp,(s’) for any state s’ (in a BSCC, all states have the same value). 
This also follows directly by inserting in Eq. (4). So, naively, for each state of 
the BSCC, we can solve a mean payoff query, and from these results obtain the 
overall stationary distribution. 
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Algorithm 1 Approximate Stationary Distribution in BSCC 
Input: Markov chain M = (S, ô) with BSCC(M) = {S} 
Output: Bounds l, u on stationary distribution TX. 
l: n1 
2: for s € S do lı (s) + 0, wi(s) + 1 
3: for s € S do 


4 m + 1, vı + INITGUESS(s) 

5: while not SHOULDSTOP(s,m, Am) do > Iterate until some stopping criterion 
6: | for s’ € S do Um41(s’) & 145}(s’) + 4(8’) (Um) > Mean payoff VI for s 
T: m—m+il 

8 U,(s) + max (In(s), ming es Am(5')), un (s) + min (un(s), MaXy/eg Am(s‘)) 


9: for s’ € S \ {s} do la(s’) © In(s’), uh(s’) — un(s’) 
10: for s' € S do > Update bounds based on current results (optional) 
11: ln+1(8") + max (Ui, (s’),1— dieawad ur, (s”)) 
Un+1(s’) < min (ui, (s’), l= S nes anza Ley) 
13: n + n + 1 and copy all unchanged values from n to n+ 1 


14: return (ln, Un) 


At first, this may seem excessive, especially considering that computing the 
complete stationary distribution is as hard as determining the mean payoff for 
one state (both can be obtained by solving a linearly sized equation system). 
However, this idea yields some interesting benefits. Firstly, using the approxi- 
mation approach discussed in Section 2, we obtain a practical approximation 
scheme with converging bounds for each state. As such, we can quickly stop the 
computation if the bounds converge fast. Moreover, we can pause and restart the 
computation for each state, which we will use later on in order to focus on crucial 
states. Finally, observe that 7? is a distribution. Thus, having lower bounds on 
some states actually already yields upper bounds for remaining states. Formally, 
for some lower bound l : S — [0,1], we have m#(s) < 1- ores ozs l(s’). If 
during our computation it turns out that a few states are actually visited very 
frequently, i.e. the sum of their lower bounds is close to 1, we can already stop 
the computation without ever investigating the other states. Note that this only 
is possible since we obtain provably correct bounds. 

Combining these ideas, we present our first algorithm template in Algorithm 1. 
We solve each state separately, by applying the classical value iteration approach 
for mean payoff until a termination criterion is satisfied. To allow for modifica- 
tions, we leave the definition of several sub-procedures open. Firstly, INITGUESS 
initializes the value vector for each mean payoff computation. We can naively 
choose 0 everywhere, obtain an initial guess by heuristics, or re-use previously 
computed values. Secondly, SHOULDSTOP decides when to stop the iteration for 
each state. A simple choice is to iterate until max A,,(s) — min Am (s) < € for 
some precision requirement £. By results on mean payoff, we can conclude that in 
this case the stationary distribution is computed with a precision of £. However, 
as we argue later on, more sophisticated choices are possible. Finally, the order 
in which states are chosen is not fixed. Indeed, any order yields correct results, 
however heuristically re-ordering the states may also bring practical benefits. 

Before we continue, we briefly argue that the algorithm is correct. 
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Theorem 1. The result returned by Algorithm 1 is correct for any MC M = (S, ô) 
with BSCC(M) = {S}. 


Proof (Sketch). Correctness of the mean payoff iteration follows from the defini- 
tion of the reward function, Eq. (4), and the correctness of value iteration for 
mean payoff [29, Sec. 8.5]. In particular, note that the states of the MC form a 
single BSCC and the model is unichain (see [29, Chp. A]), implying that all states 
have the same value. For l and u, we prove correctness inductively. The initial 
values are trivially correct. The updates based on the mean payoff computation 
are correct by the above arguments and by induction hypothesis: The maximum 
of two correct lower bounds still is a lower bound, analogous for the upper bound. 
The updates based on the bounds are correct since 7% is a distribution and I’, 
u’ are correct bounds. 


We deliberately omit introducing an explicit precision requirement in the algo- 
rithm, since we will use it as a building block later on. 


Remark 1. A variant of this approach also allows for memory savings: By handling 
one state at a time, we only need to store linearly many additional values (in the 
number of states) at any time, while an explicit equation system may require 
quadratic space. This only yields a constant factor improvement if the system 
is represented explicitly (storing 6 requires as much space), however can be of 
significant merit for symbolically encoded systems. Note that this comes at a 
cost: As we cannot stop and resume the computation for different states, we have 
to determine the correct result up to the required precision immediately. 


3.2 Reachability and Guided Sampling 


As mentioned before, the second challenge to obtain a stationary distribution 
is the reachability probability for each BSCC. We employ a sampling-based ap- 
proach using insights from [7]. There, the authors considered a single reachability 
objective, i.e. a single value per state. In contrast, we need to bound reachabil- 
ity probabilities for each BSCC. For now, suppose that all BSCCs are already 
discovered and their respective stationary distribution is already computed (or 
approximated). In other words, we have for each BSCC R € BSCC(M) bounds 
IÈ u? : R= [0,1] with Ip(s) < 7% (s) < up(s), and we want to obtain bounds 
on the stationary distribution, i.e. functions l, u such that I(s) < rẹ (s) < u(s). 
We propose to additionally compute bounds on the probability to reach each 
BSCC R, i.e. functions 1°” and u? such that 1°? (s) < Prm,s[QR] < u°®(s). By 
Eq. (3), we then have for each state s a bound on the stationary distribution 


ORTA). IRC) < 72. (o) < ORIN, R 
DE (8) + (s) < tha) SD ensom" (8) - w"(s). 


We take a route similar to [7]. There, the algorithm essentially samples a 
path through the system, possibly guided by a heuristic, terminates the sampling 
based on several criteria, and then propagates the reachability value backwards 
along the path, repeating until termination. We propose a simple modification, 
namely to sample until a BSCC is reached, and then propagate the reachability 
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Algorithm 2 Approximate BSCC Reachability 

Input: Markov chain M = (S, ô) 

Output: For each BSCC R bounds (fF, u°” on the probability to reach R. 
1: Be Unenscom) R,n¢1 

2: for R € BSCC(M) d 


3: orse Redo ELT OE 

4 for s € B \ R do 1? (s) + 0, uf? (s) — 0 

5: | for s€ S \ B do l? (s) + 0, uf? (s) = 1 

6: while SHOULDSAMPLE do > Sample until some stopping criterion 
T P 4— SAMPLESTATES > Select states to update (e.g. sample a path) 
8: for R € SELECTUPDATE(P) do > Select BSCC's to update 
9: for s € P do 

10: 1E, (s) + ô(s) (12%) 

11: unis) = 5(s)(up®) 

12: for s € S do > Update bounds based on current results (optional) 
13: for R € BSCC(M) do 

14: LÈ (s ) + max (1? (s) ),1— D RieBSCOWIRI LR uf (s)) 

15: ugt (s) = min (u n(s),1— DO R'EBSCO(M),R'4R In (s)) 

16: n + n + 1 and copy unchanged values from 12” and u” to pi and uR 


17: return {(1°,u°®) | R € BSCC(R)} 


values of that particular BSCC back along the path. Moreover, we can employ a 
similar trick as above: Due to Lemma 1, the reachability probabilities of BSCCs 
sum up to one, i.e. } ` Regsccm) Pru,s[Of] = 1 for every state s. Hence, the sum 
of lower bounds also yields upper bounds for other BSCCs, even those we have 
never encountered so far. 

Our ideas are summarized in Algorithm 2. As before, the algorithm leaves 
several choices open. Instead of requiring to sample a path, our algorithm allows 
to select an arbitrary set of states to update. We note that the exact choice of 
this sampling mechanism does not improve the worst case runtime. However, as 
first observed in [7], specially crafted guidance heuristics can achieve dramatic 
practical speed-ups on several models. Later on, we combine our two algorithms 
and derive such a heuristic. For now, we briefly prove correctness. 


Theorem 2. The result returned by Algorithm 2 is correct for any MC M = (S, ô) 
with BSCC(M) = {S}. 


Proof (Sketch). Similar to the previous algorithm, we prove correctness by induc- 
tion. The initial values for 1°” and u? are correct. Then, assume that 1°” and 
uE are correct bounds. The correctness of the back propagation updates follows 
directly by inserting in Eq. (1) (or other works on interval value iteration [7,14]). 
Updates based on the bounds in other states are correct by Lemma 1 — the sum 
of all BSCC reachability probabilities is 1. Together, this yields correctness of 
the bounds computed by the algorithm. 


To obtain termination, it is sufficient to require that every state eventually is 
selected “arbitrarily often” by SAMPLESTATES. However, as before, we delegate 
the termination proof to our combined algorithm in the following section. 
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4 Dynamic Computation with Partial Exploration 


Recall that our overarching goal is to approximate the stationary distribution 
through Eq. (4). In the previous section, we have seen how we can (i) obtain 
approximations for a given BSCC and (ii) how to approximate the reachability 
probabilities of all BSCCs through sampling. However, the naive combination of 
these algorithms would require us to compute the set of all BSCCs, approximate 
the stationary distribution in each of them until a fixed precision, and additionally 
approximate reachability for each of them. 

We now combine both ideas to obtain a sampling-based algorithm, capable of 
partial exploration, that focusses computation on relevant parts of the system. 
In particular, we construct the system dynamically, identify BSCCs on the fly, 
and interleave the exploration with both the approximation inside each explored 
BSCC (Algorithm 1) and the overall reachability computation (Algorithm 2). 
Moreover, we focus computation on BSCCs which are likely to be reached and 
thus have a higher impact on the overall error of the result. Together, our approach 
roughly performs the following steps until the required precision is achieved: 


— Sample a path through the system, guided by a heuristic, 

check if a new BSCCs is discovered or sampling ended in a known BSCC, 
refine bounds on the stationary distribution in the reached BSCC, and 

— propagate reachability bounds and additional information along the path. 


| 


| 


We first formalize a generic framework which can instantiate the classical, precise 
approach as well as our approximation building blocks and then explain our 
concrete variant of this framework to efficiently obtain e-precise bounds. 


4.1 The Framework 


Since our goal is to allow for both precise as well as approximate solutions, we 
phrase the framework using lower and upper bounds together with abstract 
refinement procedures. We first explain our algorithm and how it generalizes the 
classical approach. Then, we prove its correctness under general assumptions. 
Finally, we discuss several approximate variants. 

Algorithm 3 essentially repeats three steps until the termination condition in 
Line 4 is satisfied. First, we update the set of known BSCCs through UPDATEB- 
SSCs. In the classical solution, this function simply computes BSCC(M) once; 
our on-the-fly construction would repeatedly check for newly discovered BSCCs, 
dynamically growing the set Bn. Then, we select BSCCs for which we should 
update the stationary distribution bounds. The classical solution solves the fixed 
point equation we have discussed in Section 2 for all BSCCs, i.e. SELECTDIS- 
TRIBUTIONUPDATES yields BSCC(M) and REFINEDISTRIBUTION the precisely 
computed values both as upper and lower bounds. Alternatively, we could, for 
example, select a single BSCC and apply a few iterations of Algorithm 1. Next, 
we update reachability bounds for a selected set of BSCCs. Again, the classical 
solution solves the reachability problem precisely for each BSCC through Eq. (1). 
Instead, we could employ value iteration as suggested by Algorithm 2. 
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Algorithm 3 Stationary Distribution Computation Framework 

Input: Markov chain M = (S, ô), initial state 8, precision £ > 0 

Output: ¢-precise bounds l, u on the stationary distribution TM s 

1: for s € S do > Initial bounds for all possible BSSCs that can be discovered 
2: | 1°(s) = 0, u° (s) = 1, Rls) + 0, wile) —1 

3: ne 1, Bı + 0) 

4: while (1 =Y pek E(8 )) + È reg, (RE (8) - maxses (uz (s) — IR (s))) >edo 
5: n-n+1 
6: 

7 

8 


Bn < UPDATEBSSCs, Bn + Ures: R > Discover new BSCCs 
for R € Bn \ Bn-1, s E€ R do > Update trivial reach bounds 
: E(s) 1 > s E€ R surely reaches R 
9: | for o Æ R do u% (s) + 0 > s € R reaches no other BSCC 
10: for R € SELECTDISTRIBUTIONUPDATES(Bn) N Bn do 
11: | (U2 uğ) + REFINEDISTRIBUTION(R) > Update BSCC bounds 
12: for R € SELECTREACHUPDATES(Bn) N Bn do 
13: | (8? ug?) + REFINEREACH(R) > Update reachability bounds 
14: Copy unchanged variables from n — 1 to n 


15: Le Y preg, C8) 
16: for R€ Ba, s E€ R do 


17: | U(s) — 1? (8) - IR (s) 

18: | u(s) + min(u*(8),1 — L +12? (8)) - u®(s) 
19: for s € S\ Bn do l(s) + 0, u(s) + 0 

20: return (I, u) 


Before we present our variant, we prove correctness under weak assumptions. 
We note a subtlety of the termination condition: One may assume that upper 
bounds on the reachability are required to bound the overall error caused by each 
BSCC. Yet, as we show in the following theorem, lower bounds are sufficient. The 
upper bound is implicitly handled by the first part of the termination condition. 


Theorem 3. The result returned by Algorithm 8 is correct, i.e. € precise bounds 
on the stationary distribution, if (i) Bn C Bn41 C BSCC(M) for all n, and 
(ii) REFINEDISTRIBUTION and REFINEREACH yield correct, monotone bounds. 


The proof can be found in [26, App. B.1]. 


Remark 2. Technically, the algorithm does not need to track explicit upper 
bounds on the reachability of each BSCC at all. Indeed, for a BSCC R € Bn, we 
could use 1 — )) pregscccm)\{R} IF (s) as upper bound and still obtain a correct 
algorithm. However, tracking a separate upper bound is easier to understand and 
has some practical benefits for the implementation. 


We exclude a proof of termination, since this strongly depends on the interplay 
between the functions left open. We provide a general, technical criterion to- 
gether with a proof in [26, App. B.2]. Intuitively, as one might expect, we require 
that eventually UPDATEBSSCs identifies all relevant BSCCs, SELECTDISTRI- 
BUTIONUPDATES and SELECTREACHUPDATES select all relevant BSCCs, and 
REFINEDISTRIBUTION and REFINEREACH converge to the respective true value. 
In the following, we present a concrete template which satisfies this criterion. 
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4.2 Sampling-Based Computation 


We present our instantiation of Algorithm 3 using guided sampling and heuristics. 
Since the details of the sampling guidance heuristic are rather technical, we focus 
on how the template functions UPDATEBSSCs, SELECTDISTRIBUTION UPDATES, 
REFINEDISTRIBUTION, SELECTREACHUPDATES, and REFINEREACH are instan- 
tiated. For now, the reader may assume that states are, e.g., selected by sampling 
random paths through the system. 


— UPDATEBSSCs: We track the set of explored states, i.e. states which have 
already been sampled at least once. On these, we search for BSCCs whenever 
we repeatedly stop sampling due to a state re-appearing. 

— SELECTDISTRIBUTIONUPDATES: If we stopped sampling due to entering a 
known BSCC, we update the bounds of this single one, otherwise none. 

— REFINEDISTRIBUTION: We employ Algorithm 1 to refine the bounds until 
the error over all states is halved. 

— SELECTREACHUPDATES: We refine the reach values for all sampled states. 

— REFINEREACH: If we stopped sampling due to entering a BSCC, we back- 
propagate the reachability bounds for this BSCC in the spirit of Algorithm 2, 
i.e. for all sampled states set 10% (8) = (s) (IE) and ue? (a) = (s) (u? ®). 


We prove that this yields correct results and terminates with probability 1 through 
Theorem 3. Note that this description leaves exact details of the sampling open. 
Thus, we prove termination using (weak) conditions on the sampling mechanism. 
For readability, we define the shorthand err® = maxser u£ (s) — IE (s) denoting 
the overall error of the stationary distribution in BSCC R and err?(s) = 
uP (s) — IQE(s) the error bound on the reachability of R from s. 


Theorem 4. Algorithm 3 instantiated with our sampling-based approach yields 
correct results and terminates with probability 1 if, with probability 1, 


(S.i) the sampled states P C S satisfy Pryy,s{0P] < £ (P is a §-core [18]), 

(Sii) the initial state is sampled arbitrarily often, and 

(S.iit) for each state s sampled arbitrarily often, every successor s’ € P with 
E,(s') = maxreg, uy “(s’) - erry + maxreg,, errg*(s) > apiy is 


sampled arbitrarily often, 


where “arbitrarily often” means that if the algorithm would not terminate, this 
would happen infinitely often. 


The proof can be found in [26, App. B.3). 

Due to space constraints, we omit an in-depth description of our sampling 
method and only provide a brief summary here. In summary, our algorithm 
first selects a “sampling target” which is either “the unknown”, i.e. states not 
seen so far, to encourage exploration in the style of [18], or a known BSCC, to 
bias sampling towards it. We select a choice randomly, weighted by its current 
potential influence on the precision. The sampling process is guided by the 
chosen target, taking actions which lead to the respective target with high 
probability. In technical terms, we sample successors weighted by the upper 
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bound on reachability probability times the transition probability. Once the 
target is reached, we either explore the unknown, or improve precision in the 
reached BSCC. Finally, information is back-propagated along the path. Further 
details, in particular pitfalls we encountered during the design process, together 
with a complete instantiation of our algorithm can be found in [26, App. C]. 


5 Experimental Evaluation 


In this section, we evaluate our approaches, comparing to both our own reference 
implementation using classical methods, as well as the established model checker 
PRISM [21]. (The other popular model checkers Storm [10] and IscasMC/ePMC 
[15] do not directly support computing stationary distributions.) We implemented 
our methods in Java based on PET [24], running on consumer hardware (AMD 
Ryzen 5 3600). To solve arising linear equation systems, we use Jeigen v1.2. 
All executions are performed in a Docker container, restricted to a single CPU 
core and 8GB of RAM. For approximations, we require a precision of € = 1074. 


Tools Aside from PRISM!, we consider three variants of Algorithm 3, namely 
Classic, the classical approach, solving each BSCC through a linear equation 
system and then approximating the reachability through PRISM (using interval 
iteration), Naive, the naive sampling approach, following the transition dynamics, 
and Sample, our sampling approach, selecting a target and steering towards it. 
The sourcecode of our implementation used to run these experiments as well as 
all models and our data is available at [25]. Moreover, the current version can be 
found at GitHub [23]. 

We mention two points relevant for the comparison. First, as we show in the 
following, PRISM may yield wrong results due to a (too) simple computation. As 
such, we should not expect that our correct methods are on par or even faster. 
Second, our implementation employs conservative procedures to further increase 
quality of the result, such as compensated summation to mitigate numerical error 
due to floating-point imprecision, noticeably increasing computational effort. 


Models We consider the PRISM benchmark suite? [22], comprising several prob- 
abilistic models, in particular DTMC, CTMC, and MDP. Since there are not too 
many Markov chains in this set, we obtain further models as follows. For each 
CTMC, we consider the uniformized CTMC (which preserves the steady state 
distribution), and for MDP we choose actions uniformly at random. Unfortu- 
nately, all models obtained this way either comprise only single-state BSCCs or 
the whole model is a single BSCC. In the former case, our approximation within 
the BSCC is not used at all, in the latter, a sampling based approach needs to 
invest additional time to discover the whole system. In order to better compare 
the performance of our mean payoff based approximation approach, in these cases 


1 We observed that the default hybrid engine typically is significantly slower than the 
“explicit” variant and thus use that one, see [26, App. D]. 
? Obtained from https: //github.com/prismmodelchecker/prism-benchmarks. 
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Fig. 2. A small MC where PRISM reports wrong results for e < 1077. 


we pre-explore the whole system and compute the stationary distribution directly 
through Algorithm 1. To compare the combined performance, we additionally 
consider a handcrafted model, named branch, which comprises both transient 
states as well as several non-trivial BSCCs. 

We present selected results, highlighting different strengths and weaknesses of 
each approach. An evaluation of the complete suite can be found in [26, App. D]. 


Correctness We discovered that PRISM potentially yields wrong results, due to 
an unsafe stopping criterion. In particular, PRISM iterates the power method 
until the absolute difference between subsequent iterates is small, exactly as 
with its “unsafe” value iteration for reachability, as reported by e.g. [7]. On 
the model from Fig. 2, PRISM (with explicit engine) immediately terminates, 
printing a result of = (4, z, 5, 3). However, the correct stationary distribution is 
~ (4,2,4,2) (from left to right), which both of our methods correctly identify. 
This behaviour is due to the small difference between first and second eigenvalue 
of the transition matrix, which in turn implies that the iterates of the power 
method only change by a small amount. We note that on this example, PRISM’s 
default hybrid engine eventually yields the correct result (after ~ 108 iterations) 
due to the used iteration scheme. On small variation of the model (included in 
the artefact) it also terminates immediately with the wrong result. 


Results We summarize our results in Table 1. We observe several points. First, 
we see that the naive sampling approach can hardly handle non-trivial models. 
Second, our guided sampling approach achieves significant improvements on 
several models over both the classical, correct method as well as the potentially 
unsound approach of PRISM, in particular when hardly reachable portions of the 
state space can be completely discarded. However, on other models, the classical 
approach seems to be more appropriate, in particular on models with many likely 
to be reached BSCCs. Here, the sampling approach struggles to propagate the 
reachability bounds of all BSCCs simultaneously. Finally, as suggested by the 
phil and rabin models, using mean payoff based approximation can significantly 
outperform classical equation solving. In summary, PRISM, Classic, and Sample 
all can be the fastest method, depending on the structure of the model. However, 
recall that PRISM’s method does not give guarantees on the result. 


Further Discussion As expected, we observed that the runtime of approximation 
can increase drastically for smaller precision requirements (e.g. € = 1078) and 
solving the equation system precisely may actually be faster for some BSCCs. 
However, especially in the combined approach, if we already have some upper 
bounds on the reachability probability of a certain BSCC, we do not need to solve 
it with the original precision. Hence, a future version of the implementation could 
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Table 1. Overview of our results. For each model, we list its parameters, overall size, 
and number of BSCCs, followed by the total execution time in seconds for each tool, 
TO denotes a timeout (300 seconds), MO a memout, and err an internal error. On 
systems comprising a single BSCC, the Naive and Sample approach coincide. 


Model Parameters |S| |BSCC| PRISM Classic Naive Sample 
brp N=64 ,MAX=5 5,192 134 1.2 11 TO 4.9 
nand N=15,K=2 56,128 16 4.9 30 TO 64 
zeroconf_dl reset=false,deadline=40 ,N=1000,K=1 251,740 10,048 99 238 8.0 1.0 
phil4 9,440 1 err TO 51 
rabin3 27,766 1 err MO 178 
branch 1,087,079 1,000 155 TO TO 20 


dynamically decide whether to solve a BSCC based on mean payoff approximation 
or equation solving, combining advantages of both worlds. 

Secondly, this also highlights an interesting trade-off implicit to our approach: 
The algorithm needs to balance between exploring unknown areas and refining 
bounds on known BSCCs, in particular, since exploring a new BSCC adds 
noticeable effort: One more target for which the reachability has to be determined. 
Here, more sophisticated heuristics could be useful. 

Finally, for models with large BSCCs, such as rabin, we also observed that 
the classical linear equation approach indeed runs out of memory while a variant 
of the approximation algorithm can still solve it, as indicated by Remark 1. 
Thus, the implementation could moreover take memory constraints into account, 
deciding to apply the memory-saving approach in appropriate cases. 


6 Conclusion 


We presented a new perspective on computing the stationary distribution in 
Markov chains by rephrasing the problem in terms of mean payoff and reachability. 
We combined several recent advances for these problems to obtain a sophisti- 
cated partial-exploration based algorithm. Our evaluation shows that on several 
models our new approach is significantly more performant. As a major technical 
contribution, we provided a general algorithmic framework, which encompasses 
both the classical solution approach as well as our new method. 

As hinted by the discussion above, our framework is quite flexible. For future 
work, we particularly want to identify better guidance heuristics. Specifically, 
based on experimental data, we conjecture that the reachability part can be 
improved significantly. Moreover, due to the flexibility of our framework, we can 
apply different methods for each BSCC to obtain the reachability and stationary 
distribution. Thus, we want to find meta-heuristics which suggest the most 
appropriate method in each case. For example, for smaller BSCCs, we could 
use the classical, precise solution method to obtain the stationary distribution, 
while for larger ones we employ our mean payoff approach, and, in the spirit of 
Remark 1, for even larger ones we approximate them to the required precision 
immediately, saving memory. Additionally, we could identify BSCCs that satisfy 
the conditions of specialized approaches such as [11]. 
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Abstract. Multiple-environment MDPs (MEMDPs) capture finite sets 
of MDPs that share the states but differ in the transition dynamics. These 
models form a proper subclass of partially observable MDPs (POMDPs). 
We consider the synthesis of policies that robustly satisfy an almost-sure 
reachability property in MEMDPs, that is, one policy that satisfies a 
property for all environments. For POMDPs, deciding the existence of 
robust policies is an EXPTIME-complete problem. We show that this 
problem is PSPACE-complete for MEMDPs, while the policies require 
exponential memory in general. We exploit the theoretical results to 
develop and implement an algorithm that shows promising results in 
synthesizing robust policies for various benchmarks. 


1 Introduction 


Markov decision processes (MDPs) are the standard formalism to model sequential 
decision making under uncertainty. A typical goal is to find a policy that satisfies a 
temporal logic specification [5]. Probabilistic model checkers such as STORM [22] 
and PRISM [30] efficiently compute such policies. A concern, however, is the 
robustness against potential perturbations in the environment. MDPs cannot 
capture such uncertainty about the shape of the environment. 
Multi-environment MDPs (MEMDPs) [36,14] contain a set of MDPs, called 
environments, over the same state space. The goal in MEMDPs is to find a 
single policy that satisfies a given specification in all environments. MEMDPs 
are, for instance, a natural model for MDPs with unknown system dynamics, 
where several domain experts provide their interpretation of the dynamics [11]. 
These different MDPs together form a MEMDP. MEMDPs also arise in other 
domains: The guessing of a (static) password is a natural example in security. In 
robotics, a MEMDP captures unknown positions of some static obstacle. One 
can interpret MEMDPs as a (disjoint) union of MDPs in which an agent only has 
partial observation, i.e., every MEMDP can be cast into a linearly larger partially 
observable MDP (POMDP) [27]. Indeed, some famous examples for POMDPs are 
in fact MEMDPs, such as RockSample [39] and Hallway [31]. Solving POMDPs is 
notoriously hard [32], and thus, it is worthwhile to investigate natural subclasses. 
We consider almost-sure specifications where the probability needs to be 
one to reach a set of target states. In MDPs, it suffices to consider memoryless 
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policies. Constructing such policies can be efficiently implemented by means of a 
graph-search [5]. For MEMDPs, we consider the following problem: 


Compute one policy that almost-surely reaches the target in all environments. 


Such a policy robustly satisfies an almost-sure specification for a set of MDPs. 


Our approach. Inspired by work on POMDPs, we construct a belief-observation 
MDP (BOMDP) [16] that tracks the states of the MDPs and the (support of 
the) belief over potential environments. We show that a policy satisfying the 
almost-sure property in the BOMDP also satisfies the property in the MEMDP. 

Although the BOMDP is exponentially larger than the MEMDP, we exploit 
its particular structure to create a PSPACE algorithm to decide whether such a 
robust policy exists. The essence of the algorithm is a recursive construction of a 
fragment of the BOMDP, restricted to a setting in which the belief-support is fixed. 
Such an approach is possible, as the belief in a MEMDP behaves monotonically: 
Once we know that we are not in a particular environment, we never lose this 
knowledge. This behavior is in contrast to POMDPs, where there is no monotonic 
behavior in belief-supports. The difference is essential: Deciding almost-sure 
reachability in POMDPs is EXPTIME-complete [37,19]. In contrast, the problem 
of deciding whether a policy for almost-sure reachability in a MEMDP exists 
is indeed PSPACE-complete. We show the hardness using a reduction from the 
true quantified Boolean formula problem. Finally, we cannot hope to extract a 
policy with such an algorithm, as the smallest policy for MEMDPs may require 
exponential memory in the number of environments. 

The PSPACE algorithm itself recomputes many results. For practical purposes, 
we create an algorithm that iteratively explores parts of the BOMDP. The 
algorithm additionally uses the MEMDP structure to generalize the set of states 
from which a winning policy exists and deduce efficient heuristics for guiding 
the exploration. The combination of these ingredients leads to an efficient and 
competitive prototype on top of the model checker STORM. 


Related work. We categorize related work in three areas. 


MEMDPs. Almost-sure reachability for MEMDPs for exactly two environments 
has been studied by [36]. We extend the results to arbitrarily many environments. 
This is nontrivial: For two environments, the decision problem has a polynomial 
time routine [36], whereas we show that the problem is PSPACE-complete for 
an arbitrary number of environments. MEMDPs and closely related models 
such as hidden-model MDPs, hidden-parameter MDPs, multi-model MDPs, and 
concurrent MDPs [11,2,40,10] have been considered for quantitative properties?. 
The typical approach is to consider approximative algorithms for the undecidable 
problem in POMDPs [14] or adapt reinforcement learning algorithms [3,28]. These 
approximations are not applicable to almost-sure properties. 


POMDPs. One can build an underlying potentially infinite belief- MDP [27] that 
corresponds to the POMDP -— using model checkers [35,7,8] to verify this MDP 


1 Hidden-parameter MDPs are different than MEMDPs in that they assume a prior 
over MDPs. However, for almost-sure properties, this difference is irrelevant. 
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can answer the question for MEMDPs. For POMDPs, almost-sure reachability 
is decidable in exponential time [37,19] via a construction similar to ours. Most 
qualitative properties beyond almost-sure reachability are undecidable [4,15]. Two 
dedicated algorithms that limit the search to policies with small memory require- 
ments and employ a SAT-based approach [12,26] to this NP-hard problem [19] 
are implemented in STORM. We use them as baselines. 


Robust models. The high-level representation of MEMDPs is structurally similar 
to featured MDPs [18,1] that represent sets of MDPs. The proposed techniques 
are called family-based model checking and compute policies for every MDP in the 
family, whereas we aim to find one policy for all MDPs. Interval MDPs [25,43,23] 
and SGs [38] do not allow for dependencies between states and thus cannot model 
features such as various obstacle positions. Parametric MDPs [2,44,24] assume 
controllable uncertainty and do not consider robustness of policies. 


Contributions. We establish PSPACE-completeness for deciding almost-sure 
reachability in MEMDPs and show that the policies may be exponentially large. 
Our iterative algorithm, which is the first specific to almost-sure reachability in 
MEMDPs, builds fragments of the BOMDP. An empirical evaluation shows that 
the iterative algorithm outperforms approaches dedicated to POMDPs. 


2 Problem Statement 


In this section, we provide some background and formalize the problem statement. 

For a set X, Dist(X) denotes the set of probability distributions over X. 
For a given distribution d € Dist(X), we denote its support as Supp(d). For a 
finite set X, let unif(X) denote the uniform distribution. dirac(x) denotes the 
Dirac distribution on x € X. We use short-hand notation for functions and 
distributions, f = |x + a,y > b| means that f(x) =a and f(y) = b. We write 
P (X) for the powerset of X. For n € N we write [n] = {7 eN|1<i<n}. 


Definition 1 (MDP). A Markov Decision Process is a tuple M = (S, A, Linit, P) 
where S is the finite set of states, A is the finite set of actions, Linj, E€ Dist(S) is 
the initial state distribution, and p: S x A — Dist(S) is the transition function. 


The transition function is total, that is, for notational convenience MDPs are 
input-enabled. This requirement does not affect the generality of our results. A 
path of an MDP is a sequence 7 = 89408101... Sn such that tinit(so) > 0 and 
plsi, ai)(si+1) > 0 for all 0 < i < n. The last state of is last(7) = sn. The set of 
all finite paths is PATH and PATH(S”) denotes the paths starting in a state from 
S’ C S. The set of reachable states from S$” is Reachable( S”). If S’ = Supp(vinit) 
we just call them the reachable states. The MDP restricted to reachable states 
from a distribution d € Dist(S) is ReachFragment(M,d), where d is the new 
initial distribution. A state s € S is absorbing if Reachable({s}) = {s}. An MDP 
is acyclic, if each state is absorbing or not reachable from its successor states. 
Action choices are resolved by a policy o: PATH — Dist(A) that maps 
paths to distributions over actions. A policy of the form o: S > Dist(A) is 
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qı, q2 qı 2 a qi, q2 
n 
a2, a3 ay a1, a2 a3 


Fig. 1: Example MEMDP 


called memoryless, deterministic if we have o: PATH — A; and, memoryless 
deterministic for o: S + A. For an MDP M, we denote the probability of a 
policy ø reaching some target set T C S starting in state s as Prm(s > T | ø). 
More precisely, Prm(s + T | o) denotes the probability of all paths from s 
reaching T under o. We use Pry, (T | oc) if s is distributed according to linit. 
Definition 2 (MEMDP). A Multiple Environment MDP is a tuple N = 
(S, A, Linit, {Pi}icr) with S, A, Linit as for MDPs, and {p;}ier is a set of transition 
functions, where I is a finite set of environment indices. 

Intuitively, MEMDPs form sets of MDPs (environments) that share states and 
actions, but differ in the transition probabilities. For MEMDP WN with index set I 
and a set I’ C I, we define the restriction of environments as the MEMDP My = 
(S, A, Linit, {Pi Jier). Given an environment 7 € I, we denote its corresponding 
MDP as N; = (S, A, init, Pi). A MEMDP with only one environment is an MDP. 
Paths and policies are defined on the states and actions of MEMDPs and do not 
differ from MDP policies. A MEMDP is acyclic, if each MDP is acyclic. 


Example 1. Figure 1 shows an MEMDP with three environments N;. An agent 
can ask two questions, qı and q2. The response is either ‘switch’ (s; © s2), or 
‘stay’ (loop). In M, the response to qı and q2 is to switch. In M2, the response 
to q is stay, and to q2 is switch. The agent can guess the environment using 
@1,42,a3. Guessing a; leads to the target {2} only in environment i. Thus, an 
agent must deduce the environment via q,q2 to surely reach the target. | 


Definition 3 (Almost-Sure Reachability). An almost-sure reachability prop- 
erty is defined by a set T C S of target states. A policy o satisfies the property T 
for MEMDP N = (S, A, init, {pifier) ff Vi € I: Pry, (T | o) = 1. 


In other words, a policy o satisfies an almost-sure reachability property T, called 
winning, if and only if the probability of reaching T within each MDP is one. By 
extension, a state s € S is winning if there exists a winning policy when starting 
in state s. Policies and states that are not winning are losing. 

We will now define both the decision and policy problem: 


Given a MEMDP W and an almost-sure reachability property T. 
The Decision Problem asks to decide if a policy exists that satisfies T. 
The Policy Problem asks to compute such a policy, if it exists. 


In Section 4 we discuss the computational complexity of the decision problem. 
Following up, in Section 5 we present our algorithm for solving the policy problem. 
Details on its implementation and evaluation will be presented in Section 6. 
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3 A Reduction To Belief-Observation MDPs 


In this section, we reduce the policy problem, and thus also the decision problem, 
to finding a policy in an exponentially larger belief-observation MDP. This 
reduction is an elementary building block for the construction of our PSPACE 
algorithm and the practical implementation. Additional information such as proofs 
for statements throughout the paper are available in the technical report [41]. 


3.1 Interpretation of MEMDPs as Partially Observable MDPs 


Definition 4 (POMDP). A partially observable MDP (POMDP) is a tuple 
(M,Z,O) with an MDP M = (S, A, Linit, p), a set Z of observations, and an 
observation function O: S > Z. 


A POMDP is an MDP where states are labelled with observations. We lift O to 
paths and use O(7) = O(s1)a,O(s2)...O(s,). We use observation-based policies 
g, i.e., policies s.t. for m,n’ € PATH, O(n) = O(n’) implies o(7) = o(n’). A 
MEMDP can be cast into a POMDP that is made up as the disjoint union: 


Definition 5 (Union-POMDP). Given an MEMDP N = (S, A, vinit, (pi tier) 
we define its union-POMDP Ny = ((S", A, nin P), Z,O), with states S' = Sx I, 
initial distribution U.,,;,((8,7)) = tinals) - |7}, transitions p'((s,i),a)((s’,i)) = 


pi(s,a)(s’), observations Z = S, and observation function O((s,i)) = s. 


A policy may observe the state s but not in which MDP we are. This forces any 
observation-based policy to take the same choice in all environments. 


Lemma 1. Given MEMDP N, there exists a winning policy iff there exists an 
observation-based policy o such that Pry (T | o) = 1. 


The statement follows as, first, any observation-based policy of the POMDP can 
be applied to the MEMDP, second, vice versa, any MEMDP policy is observation- 
based, and third, the induced MCs under these policies are isomorphic. 


3.2 Belief-observation MDPs 


For POMDPs, memoryless policies are not sufficient, which makes computing 
policies intricate. We therefore add the information that the history — i.e., 
the path until some point — contains. In MEMDPs, this information is the 
(environment-)belief (support) J C I, as the set of environments that are consistent 
with a path in the MEMDP. Given a belief J C I and a state-action-state 
transition s “> s’, then we define Up(J,s,a,s’) = {i € J | pi(s,a,s’) > 0}, i.e., 
the subset of environments in which the transition exists. For a path m € PATH, 
we define its corresponding belief B(m) C I recursively as: 


B(so) =I and B(m- sas’) = Up(B(m- s), s, a, s") 


The belief in a MEMDP monotonically decreases along a path, i.e., if we know 
that we are not in a particular environment, this remains true indefinitely. 
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We aim to use a model where memoryless policies suffice. To that end, we 
cast MEMDPs into the exponentially larger belief-observation MDPs [16]?. 


Definition 6 (BOMDP). For a MEMDP N = (S, A, Linit {piticr), we define 
its belief-observation MDP (BOMDP) as a POMDP Gy = ((S', A, tinin P), Z, O) 
with states S! = S x I x P (I), initial distribution t (ls, j, 1)) = tinals) |I|}, 
transition relation p'((s, j, J),a)((s', j, J')) = p;(s, 4,8’) with J’ = Up(J, s,a, s"), 
observations Z = S x P (I), and observation function O((s, j, J)) = (s, J). 


Compared to the union-POMDP, BOMDPs also track the belief by updating it 
accordingly. We clarify the correspondence between paths of the BOMDP and 
the MEMDP. For a path 7 through the MEMDP, we can mimic this path exactly 
in the MDPs N; for j € B(z). As we track B(7) in the state, we can deduce from 
the BOMDP state in which environments we can be. 


Lemma 2. For MEMDP N and the path (s1, j, J1)a1(80,9, J2) --. (Sn, J, Jn) of 
the BOMDP Gy, let j € Jı. Then: Jn 40 and the path sia ... Sn exists in MDP 
N; fic AN Jy. 


Consequently, the belief of a path can be uniquely determined by the observation 
of the last state reached, hence the name belief-observation MDPs. 


Lemma 3. For every pair of paths 7,7’ in a BOMDP, we have: 
B(r) = B(x’) implies O(last()) = O(last(r’)). 


For notation, we define Sy = {(s,j,J) | j E€ J,s E€ S}, and analogously write 
Zz ={(s,J)| s € S}. We lift the target states T to states in the BOMDP: Tg, = 
{(s,j,J) |s€T,J CI,j € J} and define target observations Tz = O(Tg,,). 


Definition 7 (Winning in a BOMDP). Let Gy be a BOMDP with target 
observations Tz. An observation-based policy o is winning from some observation 
z € Z, if for all s € O-+(z) it holds that Prg,,(s + O-}(Tz) | 0) =1. 


Furthermore, a policy ø is winning if it is winning for the initial distribution tinit- 
An observation z is winning if there exists a winning policy for z. The winning 
region Wing, is the set of all winning observations. 

Almost-sure winning in the BOMDP corresponds to winning in the MEMDP. 


Theorem 1. There exists a winning policy for a MEMDP N with target states 
T iff there exists a winning policy in the BOMDP Gy with target states Tg,,. 


Intuitively, the important aspect is that for almost-sure reachability, observation- 
based memoryless policies are sufficient [13]. For any such policy, the induced 
Markov chains on the union-POMDP and the BOMDP are bisimilar [16]. 
BOMDPs make policy search conceptually easier. First, as memoryless policies 
suffice for almost-sure reachability, winning regions are independent of fixed 
policies: For policies ø and o’ that are winning in observation z and z’, respectively, 
there must exist a policy ĉ that is winning for both z and z’. Second, winning 
regions can be determined in polynomial time in the size of the BOMDP [16]. 


? This translation is notationally simpler than going via the union-POMDP. 
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3.3 Fragments of BOMDPs 


To avoid storing the exponentially sized BOMDP, we only build fragments: We 
may select any set of observations as frontier observations and make the states 
with those observations absorbing. We later discuss the selection of frontiers. 


Definition 8 (Sliced BOMDP). For a BOMDP Gy = ((S,A, init, p), Z, O) 
and a set of frontier observations F C Z, we define a BOMDP Gy|F = 
((S, A, Linit, pP), Z,O) with: 

dirac(s) if O(s) €F, 

p(s,a) otherwise. 


Wee Swe Assan) =| 


We exploit this sliced BOMDP to derive constraints on the set of winning states. 


Lemma 4. For every BOMDP Gy with states S and targets T and for all 
frontier observations F C Z it holds that: Wing yir Cc Wing, Cc Wing ir: 


Making (non-target) observations absorbing extends the set of losing observations, 
while adding target states extends the set of winning observations. 


4 Computational Complexity 


The BOMDP Gy above yields an exponential time and space algorithm via 
Theorem 1. We can avoid the exponential memory requirement. This section 
shows the PSPACE-completeness of deciding whether a winning policy exists. 


Theorem 2. The almost-sure reachability decision problem is PSPACE-complete. 


The result follows from Lemmas 11 and 10 below. In Section 4.3, we show that 
representing the winning policy itself may however require exponential space. 


4.1 Deciding Almost-Sure Winning for MEMDPs in PSPACE 


We develop an algorithm with a polynomial memory footprint. The algorithm 
exploits locality of cyclic behavior in the BOMDP, as formalized by an acyclic 
environment graph and local BOMDPs that match the nodes in the environment 
graph. The algorithm recurses on the environment graph while memorizing results 
from polynomially many local BOMDPs. 


The graph-structure of BOMDPs. First, along a path of the MEMDP, we 
will only gain information and are thus able to rule out certain environments [14]. 
Due to the monotonicity of the update operator, we have for any BOMDP 
that (s, j, J) € Reachable((s’,7,J’)) implies J C J’. We define a graph over 
environment sets that describes how the belief-support can update over a run. 


Definition 9 (Environment graph). Let N be a MEMDP and p the tran- 
sition function of Gy. The environment graph GEy = (Vw, Ew) for N is a 
directed graph with vertices Vy =P (I) and edges 


Ey = {(J, J‘) | s,s" € S,a E€ A, j E€ I.p((s,j, J), a, (s, j, J’)) > 0 and JA J}. 
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{1,2} {1} 
ge 
{1, 2, 3} >| {2} {1, 3} 
{2,3} {3} 


Fig. 2: The environment graph for our running example. 


Example 2. Figure 2 shows the environment graph for the MEMDP in Ex. 1. It 
consists of the different belief-supports. For example, the transition from {1, 2,3} 
to {2,3} and to {1} is due to the action qı in state so, as shown in Fig. 1. E 


Paths in the environment graph abstract paths in the BOMDP. Path fragments 
where the belief-support remains unchanged are summarized into one step, as 
we do not create edges of the form (J, J}. We formalize this idea: Let m = 
(81,9, J1)a1(82, J, Ja)... (Sn, J, Jn) be a path in the BOMDP. For any J C I, we 
call m a J-local path, if J; = J for all i € [n]. 


Lemma 5. For a MEMDP N with environment graph GEy,, there is a path 
Ji... Jn iff there is a path m= 7 ...7% in Gy s.t. every 1; is Ji-local. 


The shape of the environment graph is crucial for the algorithm we develop. 


Lemma 6. Let GEy = (Vw, En) be an environment graph for MEMDP N. 
First, Ex (J, J’) implies J’ G J. Thus, G is acyclic and has maximal path length 
|I|. The maximal outdegree of the graph is |S|?| Al. 


The monotonicity regarding J, J’ follows from definition of the belief update. The 
bound on the outdegree is a consequence from Lemma 9 below. 


Local belief-support BOMDPs. Before we continue, we remark that the 
(future) dynamics in a BOMDP only depend on the current state and set of 
environments. More formally, we capture this intuition as follows. 


Lemma 7. Let Gy be a BOMDP with states S’. For any state (s,j, J) € S’, let 
N” = ReachFragment(N 7, dirac(s)) and Y = {(s,i, J) |i € J}. Then: 


ReachFragment(Gy, unif(Y)) = Gy. 


The key insight is that restricting the MEMDP does not change the transition 
functions for the environments 7 € J. Furthermore, using monotonicity of the 
update, we only reach BOMDP-states whose behavior is determined by the 
environments in J. 

This intuition allows us to analyze the BOMDP locally and lift the results 
to the complete BOMDP. We define a local BOMDP as the part of a BOMDP 
starting in any state in Sy. All observations not in Zj are made absorbing. 


Definition 10 (Local BOMDP). Given a MEMDP N with BOMDP Gy and 
a set of environments J. The local BOMDP for environments J is the fragment 


LocG(J) = ReachFragment(Gy,,|F,unif(S;)) where F=Z\Z;. 
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Algorithm 1 Search algorithm 


1: function SEARCH(MEMDP N = (S, A, {Pi Jier tinit) J CI, TC S) 

2 T'e{ls,j,J)|7eJ,s€T} 

3 for J’ s.t. Ex (J, J’) do > Consider the edges in the env. graph (Def. 9) 
4: Wy + SEARCH(N, J’, T) > Recursion! 
5: T'e T'U {(s, 9,0") |j EJ, (s, J") E Wy} 

6 return WinZooc(J) N Zy > Construct BOMDP as in Def. 10, then model check 
T 

8: function ASWINNING(MEMDP N = (S, A, {pi} je, tinit), T C S) 

9 return O(Supp(tinit)) C SEARCH(N, J, T) 


This definition of a local BOMDP coincides with a fragment of the complete 
BOMDP. We then mark exactly the winning observations restricted to the 
environment sets J’ Ç J as winning in the local BOMDP and compute all 
winning observations in the local BOMDP. These observations are winning in 
the complete BOMDP. The following concretization of Lemma 4 formalizes this. 


Lemma 8. Consider a MEMDP N and a subset of environments J. 
Wine Zr = Wink AZs with Th, = Toy U(Wing™ \ Z 
INLocg(J) J = Ngy J U Gu — +Gn ( Ngy \ J): 


Furthermore, local BOMDPs are polynomially bounded in the size of the MEMDP. 


Lemma 9. Let N be a MEMDP with states S and actions A. LOCG(J) has at 
most O(|S|? - |A|- |J|) states and O(|S|? - |A| -|J|?) transitions? . 


A PSPACE algorithm. We present Algorithm 1 for the MEMDP decision 
problem, which recurses depth-first over the paths in the environment graph‘. 
We first state the correctness and the space complexity of this algorithm. 


Lemma 10. ASWINNING in Alg. 1 solves the decision problem in PSPACE. 


To prove correctness, we first note that SEARCH(N, J, T) computes Wing NZJ. 
We show this by induction over the structure of the environment graph. For all 
J without outgoing edges, the local BOMDP coincides with a BOMDP just for 
environments J (Lemma 7). Otherwise, observe that T” in line 5 coincides with 


its definition in Lemma 8 and thus, by the same lemma, we return Wing N Zj. 
To finalize the proof, a winning policy exists in the MEMDP if the observation of 
the initial states of the BOMDP are winning (Theorem 1). The algorithm must 
terminate as it recurses over all paths of a finite acyclic graph, see Lemma 6. 
Following Lemma 9, the number of frontier states is then bounded by |S]? - | A]. 
The main body of the algorithm therefore requires polynomial space, and the 
maximal recursion depth (stack height) is |Z| (Lemma 6). Together, this yields a 
space complexity in O(|$|? -|A| - |I|?). 


3 The number of transitions is the number of nonzero entries in p 
4 In contrast to depth-first-search, we do not memorize nodes we visited earlier. 
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Fig. 3: Constructed MEMDP for the QBF formula Wxdy[(x V y) A (“x V 7y)]. 


4.2 Deciding Almost-Sure Winning for MEMDPs Is PSPACE-hard 
It is not possible to improve the algorithm beyond PSPACE. 

Lemma 11. The MEMDP decision problem is PSPACE-hard. 

Hardness holds even for acyclic MEMDPs and uses the following fact. 


Lemma 12. If a winning policy exists for an acyclic MEMDP, there also exists 
a winning policy that is deterministic. 


In particular, almost-sure reachability coincides with avoiding the sink states. 
This is a safety property. For safety, deterministic policies are sufficient, as 
randomization visits only additional states, which is not beneficial for safety. 
Regarding Lemma 11, we sketch a polynomial-time reduction from the 
PSPACE-complete TQBF problem [20] problem to the MEMDP decision problem. 
Let Y be a QBF formula, Y = 4x Vy sreVye... danVyn [S] with ® a Boolean 
formula in conjunctive normal form. The problem is to decide whether W is true. 


Example 3. Consider the QBF formula Y = Wardy[(x V y) A (“x V 7y)]. We 
construct a MEMDP with an environment for every clause, see Figure 3°. The 
state space consists of three states for each variable v € V: the state v and 
the states vT and vl that encode their assignment. Additionally, we have a 
dedicated target W and sink state F. We consider three actions: The actions true 
(T) and false (L) semantically describe the assignment to existentially quantified 
variables. The action any œg is used for all other states. Every environment 
reaches the target state iff one literal in the clause is assigned true. 

In the example, intuitively, a policy should assign the negation of x to y. 
Formally, the policy ø, characterized by o(7-y) = T iff a, € 7, is winning. E 


As a consequence of this construction, we may also deduce the following theorem. 
Theorem 3. Deciding whether a memoryless winning policy exists is NP-complete. 


The proof of NP hardness uses a similar construction for the propositional SAT 
fragment of QBF, without universal quantifiers. Additionally, the problem for 
memoryless policies is in NP, because one can nondeterministically guess a (poly- 
nomially sized) memoryless policy and verify in each environment independently. 


5 We depict a slightly simplified MEMDP for conciseness. 
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Fig. 4: Witness for exponential memory requirement for winning policies. 


4.3 Policy Problem 


Policies, mapping histories to actions, are generally infinite objects. However, we 
may extract winning policies from the BOMDP, which is (only) exponential in the 
MEMDP. Finite state controllers [34] are a suitable and widespread representation 
of policies that require only a finite amount of memory. Intuitively, the number 
of memory states reflects the number of equivalence classes of histories that a 
policy can distinguish. In general, we cannot hope to find smaller policies than 
those obtained via a BOMDP. 


Theorem 4. There is a family of MEMDPs {N"}n>1 where for each n, N” 
has 2n environments and O(n) states and where every winning policy for N” 
requires at least 2” memory states. 


We illustrate the witness. Consider a family of MEMDPs {N" }n, where N” 
has 2n MDPs, 4n states partitioned into two parts, and at most 2n outgoing 
actions per state. We outline the MEMDP family in Figure 4. In the first part, 
there is only one action per state. The notation is as follows: in state so and 
MDP N’, we transition with probability one to state ag, whereas in M} we 
transition with probability one to state bo. In every other MDP, we transition with 
probability one half to either state. In state s1, we do the analogous construction 
for environments 3, 4, and all others. A path sgb,... is thus consistent with 
every MDP except V7". The first part ends in state sn. By construction, there 
are 2” paths ending in s,,. Each of them is (in)consistent with a unique set of n 
environments. In the second part, a policy may guess n times an environment by 
selecting an action a; for every i € [2n]. Only in MDP NV", action a; leads to a 
target state. In all other MDPs, the transition leads from state g; to gj41. The 
state gn+1 is absorbing in all MDPs. Importantly, after taking an action a; and 
arriving in g;+1, there is (at most) one more MDP inconsistent with the path. 
Every MEMDP N” in this family has a winning policy which takes o(7-g;) = 
Qzi—1 if a; E€ m and o(7- gi) = ag; otherwise. Furthermore, when arriving in 
state sn, the state of a finite memory controller must reflect the precise set of 
environments consistent with the history. There are 2” such sets. The proof shows 
that if we store less information, two paths will lead to the same memory state, 
but with different sets of environments being consistent with these paths. As we 
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can rule out only n environments using the n actions in the second part of the 
MEMDP, we cannot ensure winning in every environment. 


5 A Partial Game Exploration Algorithm 


In this section, we present an algorithm for the policy problem. We tune the 
algorithm towards runtime instead of memory complexity, but aim to avoid 
running out of memory. We use several key ingredients to create a pragmatic 
variation of Alg. 1, with support for extracting the winning policy. 

First, we use an abstraction from BOMDPs to a belief stochastic game 
(BSG) similar to [45] that reduces the number of states and simplifies the 
iterative construction®. Second, we tailor and generalize ideas from bounded model 
checking [6] to build and model check only a fragment of the BSG, using explicit 
partial exploration approaches as in, e.g., [33,9,42,29]. Third, our exploration 
does not continuously extend the fragment, but can also prune this fragment by 
using the model checking results obtained so far. The structure of the BSG as 
captured by the environment graph makes the approach promising and yields 
some natural heuristics. Fourth, the structure of the winning region allows to 
generalize results to unseen states. We thereby operationalize an idea from [26] in 
a partial exploration context. Finally, we analyze individual MDPs as an efficient 
and significant preprocessing step. In the following we discuss these ingredients. 


Abstraction to Belief Support Games. We briefly recap stochastic games 
(SGs). See [38,17] for more details. 


Definition 11 (SG). A stochastic game is a tuple B = (M, S1, S2), where 
M = (S, A, linit, p) is an MDP and (S1, S2) is a partition of S. 


Sı are Player 1 states, and S2 are Player 2 states. As common, we also ‘par- 
tition’ (memoryless deterministic) policies into two functions 01: S; > A and 
01: S2 > A. A Player 1 policy ø, is winning for state s if Pr(T | 01,02) for all 
o2. We (re)use Wing, to denote the set of states with a winning policy. 

We apply a game-based abstraction to group states that have the same 
observation. Player 1 states capture the observation in the BOMDP, i.e., tuples 
(s, J) of MEMDP states s and subsets J of the environments. Player 1 selects 
the action a, the result is Player 2 state ((s,J),a). Then Player 2 chooses an 
environment j € J, and the game mimics the outgoing transition from (s, j, J), 
i.e., it mimics the transition from s in M}. Formally: 

Definition 12 (BSG). Let Gu be a BOMDP with Gy = ((S, A, Linit, P), Z, O). 
A belief support game By for Gy is an SG By = ((S", A’, thnit P), S1, S2) with 
S’ = Sı U Sz as usual, Player 1 states Sı = Z, Player 2 states S2 = Z x A, 
actions A’ = AUT, initial distribution ving ((8,1)) = X icr tinit( (8, i, T)), and the 
(partial) transition function p defined separately for Player 1 and 2: 

p'(z, a) = dirac((z, a)) (Player 1) 
p ((z,a), 9,2’) = p((s, j, J), a, (8’, 9, J'Y) with z = (s, J), 2’ =(s', J’) (Player 2) 


6 At the time of writing, we were unaware of a polytime algorithm for BOMDPs. 
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Algorithm 2 Policy finding algorithm 
: function FINDPoLiIcy(MEMDP N = (S, A, {Pi }icr tinit), targets T C S) 
W + {(s,J)|sE€T,J CI}; L+; i< ls Sinit +} Supp(tinit) x {I} 
while Sinit OW 4 W and Sinit N L = 0 do 

(B, F) + GenerateGameSlice( N, W, L, i) 

W + W U Win 

L+ LU S \ Wing’? 

i} i+l1 
if Sinit C W then return ExtractPolicy(W) else return L 


Lemma 13. An (acyclic) MEMDP N with target states T is winning if(f) there 
exists a winning policy in the BSG By with target states Tz. 


Thus, on acyclic MEMDPs, a BSG-based algorithm is sound and complete, 
however, on cyclic MDPs, it may not find the winning policy. The remainder of 
the algorithm is formulated on the BSG, we use sliced BSGs as the BSG of a 
sliced BOMDP, or equivalently, as a BSG with some states made absorbing. 


Main algorithm. We outline Algorithm 2 for the policy problem. We track 
the sets of almost-sure observations and losing observations (states in the BSG). 
Initially, target states are winning. Furthermore, via a simple preprocessing, we 
determine some winning and losing states on the individual MDPs. 

We iterate until the initial state is winning or losing. Our algorithm constructs 
a sliced BSG and decides on-the-fly whether a state should be a frontier state, 
returning the sliced BSG and the used frontier states. We discuss the implemen- 
tation below. For the sliced BSG, we compute the winning region twice: Once 
assuming that the frontier states are winning, once assuming they are loosing. 
This yields an approximation of the winning and losing states, see Lemma 4. 
From the winning states, we can extract a randomized winning policy [13]. 


Soundness. Assuming that the By is indeed a sliced BSG with frontier F. Then 
the following invariant holds: W C Wing, and LN Wingy = (). This invariant 
exploits that from a sliced BSG we can (implicitly) slice the complete BSG while 
preserving the winning status of every state, formalized below. In future iterations 
we only explore the implicitly sliced BSG. 


Tg 


Lemma 14. Given W C Wing Tsy UW 


. T ST : 
N and LC S \ Wing. : Wing, = Wing AwuL 


N 
Termination depends on the sliced game generation. It suffices to ensure that in 
the long run, either W or L grow as there are only finitely many states. If W and 
L remain the same longer than some number of iterations, W U L will be used as 
frontier. Then, the new game will suffice to determine if s € W in one shot. 


Generating the sliced BSG. Algorithm 3 outlines the generation of the sliced 
BSG. In particular, we explore the implicit BSG from the initial state but make 
every state that we do not explicitly explore absorbing. In every iteration, we first 
check if there are states in Q left to explore and if the number of explored states 
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Algorithm 3 Game generation algorithm 


1: function GENERATEGAMESLICE(MEMDP W, W, L, i) 

2 Q@e{s} B={s} 

3 while s € Q and |E| < Bound{[i] exists do 

4: E+ EU{s} > Mark s as explored 
5 B + By |(S \ E) > Extend game, cut-off everything not explored 
6 Q + Reachable(B) \ (EU W UL) > Add newly reached states 
7 return 6,Q 


in E is below a threshold Bound[#]. Then, we take a state from the priority queue 
and add it to Æ. We find new reachable states’ and add them to the queue Q. 


Generalizing the winning and losing states. We aim to determine that a 
state in the game By is winning without ever exploring it. First, observe: 


Lemma 15. A winning policy in MEMDP N is winning in N17 for any J. 


A direct consequence is the following statement for two environments J; C Jo: 
(s, J2) € Wing, implies (s, J1) € Wing... 


Consequently, we can store W (and symmetrically, L) as follows. For every 
MEMDP state s € S, W, = {J | (s, J) € W} is downward closed on the partial 
order P = (I, C). This allows for efficient storage: We only have to store the set 
of pairwise maximal elements, i.e., the antichain, 


wm — {J € W, | YJ! € W, with J Z J'}. 


To determine whether (s, J} is winning, we check whether J C J’ for some 
J’ e wre, Adding J to W** requires removing all J’ C J and then adding J. 
Note, however, that |W"*| is still exponential in |I| in the worst case. 


Selection of heuristics. The algorithm allows some degrees of freedom. We 
evaluate the following aspects empirically. (1) The maximal size bound|i] of a 
sliced BSG at iteration i is critical. If it is too small, the sets W and L will grow 
slowly in every iteration. The trade-off is further complicated by the fact that 
the sets W and L may generalize to unseen states. (2) For a fixed bound[i], it 
is unclear how to prioritize the exploration of states. The PSPACE algorithm 
suggests that going deep is good, whereas the potential for generalization to 
unseen states is largest when going broad. (3) Finally, there is overhead in 
computing both W and L. If there is a winning policy, we only need to compute 
W. However, computing L may ensure that we can prune parts of the state space. 
A similar observation holds for computing W on unsatisfiable instances. 


Remark 1. Algorithm 2 can be mildly tweaked to meet the PSPACE algorithm 
in Algorithm 1. The priority queue must ensure to always include complete 


T In 1. 5 we do not rebuild the game B from scratch but incrementally construct the 
data structures. Likewise, reachable states are a direct byproduct of this construction. 
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Fig. 5: Performance of baselines and novel PAGE algorithm 


(reachable) local BSGs and to explore states (s, J} with small J first. Furthermore, 
W and L require regular pruning, and we cannot extract a policy if we prune W 
to a polynomial size bound. Practically, we may write pruned parts of W to disk. 


6 Experiments 


We highlight two aspects: (1) A comparison of our prototype to existing baselines 
for POMDPs, and (2) an examination of the exploration heuristics. The technical 
report [41] contains details on the implementation, the benchmarks, and more 
results. 


Implementation. We provide a novel PArtial Game Exploration (PAGE) prototype, 
based on Algorithm 2, on top of the probabilistic model checker STORM [22]. 
We represent MEMDPs using the PRISM language with integer constants. Every 
assignment to these constants induces an explicit MDP. SGs are constructed and 
solved using existing data structures and graph algorithms. 


Setup. We create a set of benchmarks inspired by the POMDP and MEMDP 
literature [26,12,21]. We consider a combination of satisfiable and unsatisfiable 
benchmarks. In the latter case, a winning policy does not exist. We construct 
POMDPs from MEMDPs as in Definition 5. As baselines, we use the following 
two existing POMDP algorithms. For almost-sure properties, a belief-MDP 
construction [7] acts similar to an efficiently engineered variant of our game- 
construction, but tailored towards more general quantitative properties. A SAT- 
based approach [26] aims to find increasingly larger policies. We evaluate all 
benchmarks on a system with a 3GHz Intel Core i9-10980XE processor. We use 
a time limit of 30 minutes and a memory limit of 32 GB. 


Results. Figure 5 shows the (log scale) performance comparisons between differ- 
ent configurations’. Green circles reflect satisfiable and red crosses unsatisfiable 
benchmarks. On the x-axis is PAGE in its default configuration. The first plot 
compares to the belief-MDP construction. The tailored heuristics and representa- 
tion of the belief-support give a significant edge in almost all cases. The few points 


8 Every point (x,y) in the graph reflects a benchmarks which was solved by the 
configuration on the x-axis in x time and by the configuration on the y-axis in y time. 
Points above the diagonal are thus faster for the configuration on the x-axis. 
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Table 1: Satisfiable and unsatisfiable benchmark results 


PaGE(posentr) PAGE(negentr) Belief SAT 


81 21 81| 41.1 170291 38.6 296407 MO 
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below the line are due to a higher exploration rate when building the state space. 
The second plot compares to the SAT-based approach, which is only suitable 
for finding policies, not for disproving their existence. This approach implicitly 
searches for a particular class of policies, whose structure is not appropriate for 
some MEMDPs. The third plot compares PAGE in the default configuration — 
with negative entropy as priority function — with PAGE using positive entropy. 
As expected, different priorities have a significant impact on the performance. 
Table 1 shows an overview of satisfiable and unsatisfiable benchmarks. Each 
table shows the number of environments, states, and actions-per-state in the 
MEMDP. For PAGE, we include both the default configuration (negative entropy) 
and variation (positive entropy). For both configurations, we provide columns 
with the time and the maximum size of the BSG constructed. We also include the 
time for the two baselines. Unsurprisingly, the number of states to be explored is 
a good predictor for the performance and the relative performance is as in Fig. 5. 


7 Conclusion 


This paper considers multi-environment MDPs with an arbitrary number of 
environments and an almost-sure reachability objective. We show novel and 
tight complexity bounds and use these insights to derive a new algorithm. This 
algorithm outperforms approaches for POMDPs on a broad set of benchmarks. 
For future work, we will apply an algorithm directly on the BOMDP [16]. 
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Abstract. Mungojerrie is an extensible tool that provides a frame- 
work to translate linear-time objectives into reward for reinforcement 
learning (RL). The tool provides convergent RL algorithms for stochas- 
tic games, reference implementations of existing reward translations for 
w-regular objectives, and an internal probabilistic model checker for 
w-regular objectives. This functionality is modular and operates on shared 
data structures, which enables fast development of new translation tech- 
niques. Mungojerrie supports finite models specified in PRISM and 
w-automata specified in the HOA format, with an integrated command 
line interface to external linear temporal logic translators. Mungojerrie 
is distributed with a set of benchmarks for w-regular objectives in RL. 


1 Introduction 


Reinforcement learning (RL) [41] is a sequential optimization approach where 
a decision maker learns to optimally resolve a sequence of choices based on 
feedback received from the environment. This feedback often takes the form of 
rewards and punishments proportional to the fitness of the decisions taken by 
the agent (or their effects) as judged by the environment towards some higher- 
level objectives. We call such objectives learning objectives. RL is inspired by the 
way dopamine-driven organisms latch on to past rewarding actions and hence, 
historically, RL adopted a myopic way of looking at the reward sequences in the 
form of the discounted-sum of rewards, where the discount factor controls the 
weight placed toward future rewards. More recently, other forms of reward aggre- 
gation, such as limit-average, have also been considered. A key design challenge 
for users of RL is that of translation: given a class of learning objectives and 
aggregator functions, design a reward function from the sequence of learner’s 
choices to scalar rewards such that an RL agent maximizing the aggregated sum 
of rewards converges to an optimal policy for the learning objective. 


* Mungojerrie is available at plv.colorado.edu/mungojerrie. This work is supported in 
part by the National Science Foundation (NSF) grant CCF-2009022 and by NSF CA- 
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Union’s Horizon 2020 research and innovation programme under grant agreements 
No 864075 (CAESAR) and 956123 (FOCETA). 
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Fig. 1. The reinforcement learning loop implemented within Mungojerrie. The inter- 
preter assigns reward to the agent based on the state of the model and automaton. 


The translation of objectives to reward signals has historically been a largely 
manual process. Such translations not only depend on the expertise of the trans- 
lator in reward engineering, they also pose obstacles to providing formal guar- 
antees on the faithfulness of the translation. Unsurprisingly, specifying reward 
manually is prone to error [22,44]. As the practice of model-free RL continues 
to produce impressive results [38,31,29], the integration of RL in safety-critical 
system design is inevitable. An alternative to manually programming the reward 
function is to specify the objective in a formal language and have it “compiled” 
to a reward function. We call such a translation a reward scheme. 

In designing reward schemes for RL, one strives to achieve an overall trans- 
lation that is faithful (maximizing reward means maximizing the probability of 
achieving the objective) and effective (RL quickly converges to optimal strate- 
gies). While the faithfulness of a reward scheme can be established theoretically, 
its effectiveness requires experimental evaluation. Experimenting with reward 
schemes requires a framework for specifying learning objectives, environments, 
a wide range of RL algorithms, and an interface for connecting reward schemes 
with these components. In addition, it may be beneficial to have access to a 
probabilistic model checker to evaluate the quality of the policy computed by 
RL, and to compare it against ground truth. 


Mungojerrie is designed to provide this functionality for learning require- 
ments expressible as linear-time objectives (w-regular languages [32] and 
linear temporal logic [27,33]) against finite MDPs and stochastic games. 


Features. Mungojerrie is designed with ease of use and extensibility in mind. 
Models in Mungojerrie can be specified in PRISM [25], which maintains compati- 
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bility with existing benchmarks, or by explicitly constructing the model via calls 
to internal functions. Mungojerrie supports reading w-automata in the Hanoi 
Omega Automata (HOA) format [2], and has a command line interface con- 
necting Mungojerrie with performant LTL translators (Spot [7] and Owl [24]). 
Mungojerrie provides an OpenAI Gym [4] like interface between the RL algo- 
rithms (included with the tool) and the learning environment to allow integra- 
tion with off-the-shelf RL algorithms. The tool also has methods for performing 
probabilistic model checking (including end-component decomposition, stochas- 
tic shortest-path, and discounted-reward optimization) of w-regular objectives 
on the same data structures used for learning. Mungojerrie also provides refer- 
ence implementations of several reward schemes [11,12,14,19,23] proposed by the 
formal methods community. Mungojerrie is packaged with over 100 benchmarks 
and outputs GraphViz [8] for easy visualization of small models and automata. 


An introductory example. Figure 2 shows an example MDP in which a gam- 
bler places bets with the aim of accumulating a wealth of 7 units. In addition 
the gambler will quit if her wealth wanes to just one unit more than once. This 
objective is captured by the (deterministic) Biichi automaton of Fig. 3. Mungo- 
jerrie computes a strategy for the gambler that maximizes the probability of 
satisfying her objective. Figure 4 shows the Markov chain that results from fol- 
lowing this strategy. This figure was minimally modified from GraphViz output 
from Mungojerrie. Note that the strategy altogether avoids the state in which 
x = 1; hence it achieves the same probability of success (5/7) as an optimal 
strategy for the simpler objective of eventually reaching x = 7 (without going 
broke). Mungojerrie computes the strategy of Fig. 4 by RL; it can also verify it 
by probabilistic model checking. 


2 Overview of Mungojerrie 


Models. The systems used in Mungojerrie consist of finite sets of states and 
actions, where states are labeled with atomic propositions. There are at most 
two strategic players: Max player and Min player. Each state is controlled by 
one player. We call models where all states are controlled by Max player Markov 
decision processes (MDPs) [34]. Else, we refer to them as stochastic games [5]. 
Mungojerrie supports parsing models specified in the PRISM language. The 
allowed model types are “mdp” (Markov decision process) and “smg” (stochas- 
tic multiplayer game) with two players. There should be one initial state. The 
interface for building the model is exposed, allowing extensions of Mungojerrie 
to connect with parsers for other languages. The authors of [6] used Mungojerrie 
in their experiments by extending the tool to support continuous-time MDPs. 


Properties. The properties natively supported by Mungojerrie are w-regular 
languages. Starting from the initial state, the players produce an infinite se- 
quence of states with a corresponding infinite sequence of atomic propositions: 
an w-word. The inclusion of this w-word in our w-regular language determines 
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whether or not this particular run satisfies the property. The Max player maxi- 
mizes the probability that a run is satisfying, while goal of the Min player is the 
opposite. 


We specify our w-regular language as an w-automaton, which may be nonde- 
terministic. For model checking and RL, this nondeterminism must be resolved 
on the fly. Automata where this can be done in any MDP without changing 
acceptance are said to be Good-for-MDPs (GFM) [13]. Automata where this 
can be done in any stochastic game without changing acceptance are said to be 
Good-for-Games (GFG) [21]. In general, nondeterministic Biichi automata are 
not GFM, but two classes of GFM Biichi automata with limited nondeterminism 
have been studied: suitable limit-deterministic Biichi automata [10,37] and slim 
Biichi automata [13]. 

The user of Mungojerrie can either provide the w-automaton directly or use 
one of the supported external translators to generate the automaton from LTL 
with a single call to Mungojerrie. Mungojerrie reads automata specified in the 
HOA format. Mungojerrie supports providing the w-automaton directly for test- 
ing the effectiveness of different automata for learning (see Section 4). The LTL 
translators that can be called from Mungojerrie are the EPMC plugin from 
[13], SPoT [7], and Owl [24] for generating slim Büchi, deterministic parity, 
and suitable limit-deterministic Btichi automata. The user is responsible for the 
w-automata provided directly having the appropriate property, GFM or GFG. 

For use in Mungojerrie, the labels and acceptance conditions for the au- 
tomaton should be on the transitions. The acceptance conditions supported by 


module gambler 
x: [0..7] init Wealth; 


o |mdp 

1 

2|const int Wealth = 5; // initial gambler’s wealth 
3|const double p = 1/2; // probability of winning one bet 
4 

s label “rich =x = 7 

è (label = poor = x= 1; 

7 

8 

9 


BR 
o 


[b0] x=0 V x=7 —> true; // absorbing states 

[b1] x>0 A x<7 > p: (x'=x+1) + (1—p) : (x*'=x-1); 

[b2] x>1 A x<6 > p : (x'=x+2) + (1—p) : (x'=x-2); 

[b3] x>2 A x<5 > (x’=x+3) + (1—p) : (x'=x—3); 
endmodule 


e e 
N oe 


=. 
w 


=. 
A 


BR 
a 


Fig. 2. A Gambler’s Ruin model in the PRISM language. Line 13, for example, says 
that when 1 < x < 6, the gambler may bet two units because action b2 is enabled. 
The ‘+’ sign does double duty: as addition symbol in arithmetic expressions and as 
separator of probabilistic transitions. 
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Fig. 3. Deterministic Büchi automaton equivalent to the LTL formula —poor U (richV 
(poor A X(spoor Urich))). The transitions marked with the green dots are accepting. 
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Fig. 4. Optimal gambler strategy for the objective of Fig. 3. Boxes are decision states 
and circles are probabilistic choice states. For a decision state, the label gives the value 
of x and the state of the automaton. Transitions are labelled with either an action or 
a probability, and with the priority (1 for accepting and 0 for non-accepting). 


Mungojerrie should be reducible to parity acceptance conditions without al- 
tering the transition structure of the automaton. This includes parity, Buchi, 
co-Biichi, Streett 1 (one pair), and Rabin 1 (one pair) conditions. Nondetermin- 
istic automata must have Büchi acceptance conditions. Generalized acceptance 
conditions are not supported in version 1.1. 


Reinforcement Learning. The RL algorithms optimize over MDP/Stochas- 
tic game environments equipped with a Markovian reward function. The re- 
ward function assigns a reward Ri, E€ R dependent on the state and action at 
timestep t and the next state at timestep t+ 1. As the players make their choices 
within the environment, the resulting play produces a sequence of states, actions, 
and rewards (So, Ag, Ri, $1, A1, Ro,...). The discounted reward aggregator is 


disc} (m, v) = Ew |X Ren] ; 


t>0 


where 7 is the strategy for Max player, v is the strategy for Min player, y € [0, 1) 
is the discount factor, and R; is the reward at timestep t. We can set y = 1 when 
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with probability 1 we enter an absorbing sink (termination), where we receive no 
reward. This is called the episodic setting. Another well-studied RL aggregator 
is the limit-average reward defined as 


1 
ave (7,v) = lim sup = | 5 Riza] ; 


n 
n—- oo n>t>0 


The limit-average reward aggregator is natural in the continuing setting, where 
the agent’s trajectory is never reset and there is no preferred initial state [30]. 
The objective of RL is to compute the optimal value and policies for a given 
aggregator. Mungojerrie includes the stochastic game extensions of Q-learning 
[43], Double Q-learning [20], and Sarsa(A) [40] for RL in finite state and action 
models. Mungojerrie also includes Differential Q-learning [42] for average RL 
in finite communicating MDPs. We collectively refer to parameters that are set 
by hand prior to running an RL algorithm as hyperparameters. Mungojerrie 
supports changing all hyperparameters from the command line. As the design of 
Mungojerrie separates the learning agent(s) from the reward scheme, extending 
Mungojerrie to include another RL algorithm is easy. 


Reward Schemes. The user of Mungojerrie can either select one of the reward 
schemes included with the tool or extend the tool to include a new reward 
scheme. Mungojerrie also allows the use of the reward specified in the PRISM 
model (either state- or action-based). The following reward schemes are included 
in version 1.1 of Mungojerrie: 


— Limit-reachability. The limit-reachability scheme [11] uses a GFM Biichi au- 
tomaton. This reward scheme converts accepting edges in the automaton into a 
transition to a sink with probability 1—¢ with a reward of +1, where 0 < ¢ < Lis 
a hyperparameter. All other transitions produce zero reward. For a sufficiently 
large Ç and discount factor y, strategies that are optimal for the discounted 
reward maximize the probability of satisfaction of the Büchi objective. 

— Multi-discounted. The multi-discounted reward scheme [3] also uses a GFM 
Biichi automaton. This translation converts accepting edges in the automaton 
into a transition that gives 1—yg reward with a discount of yg, where 0 < yg < 1 
is a hyperparameter. All other transitions yield no reward and are discounted by 
the standard discount factor y. For suitably large yg and y, discounted reward 
optimal strategies maximize the probability of satisfaction of the Büchi objective. 
— Dense limit-reachability. The dense limit-reachability reward scheme [12] con- 
nects the approaches of [11] and [3]. This reward scheme is identical to [11] 
except for giving a +1 reward given every time an accepting transition is seen, 
instead of only when the transition to the sink succeeds. Since discounting can 
be thought of as a constant stopping probability [41], this reward scheme is the 
same in expectation as a scaled version of [3]. 

— Parity. The parity reward scheme was proposed for stochastic games in [14]. 
For two-player games, it requires a GFG automaton. This translation utilizes a 
deterministic parity automaton with a max odd objective. Transitions of priority 
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i go to a sink with probability e*~*, where k is the number of priorities and 
0 <e< 1 is a hyperparameter. The transition to the sink receives a +1 or —1 
reward for odd or even priorities, respectively. All other transitions receive a zero 
reward. For sufficiently small €, maximizing the cumulative reward results in a 
strategy maximizing the probability of satisfaction of the parity objective. 
— Priority tracker. The priority tracker reward scheme was proposed by Hahn et 
al. [14]. For MDPs, Hahn et al. introduce a priority tracker gadget that takes a 
parity objective with a hyperparameter 0 < e < 1. The priority tracker consists 
of two stages. In stage one, we wait for transients to end by ending the stage with 
probability € on each step. In the second stage, we detect the maximum priority 
occuring infinitely often with a set of wait states, where we accept the current 
maximum with probability € on each step. For sufficiently small € and large 
discount y, maximizing the discounted reward also maximizes the probability of 
satisfaction of the parity objective. 
— Lexicographic. Hahn et al.[19] proposed this reward scheme for lexicographic 
w-regular objectives. In this reward scheme, there is a tracker gadget that keeps 
track of which accepting edges for the GFM Biichi automata have been seen. 
When the tracker indicates that at least one accepting edge has been seen, the 
learning agent can decide to “cash in” the tracker, which clears the tracker. 
When this happens, with probability 1 — ¢ the learning agent receives a reward 
which is the weighted sum of seen accepting edges, scaled by powers of f, and 
transitions to a terminating sink, where 0 < Ç < 1 and f > 1 are hyperpa- 
rameters. For suitable f, ¢, and y, maximizing the discounted reward yields the 
lexicographically optimal strategy. 
— Average. The average reward scheme [23] translates absolute liveness w-regular 
objectives, which means the objective is concerned with eventual satifaction, 
to average reward for communicating MDPs. Given a GFM Biichi automaton, 
transitions from every state in the automaton back to the initial state are in- 
troduced, so called “resets”. A hyperparameter c < 0 is introduced which gives 
a penalizing reward to these resets. Accepting edges are then given a reward 
of +1. Positional policies that maximize the average reward also maximize the 
probability of satisfaction of the objective. 
— Reward on accept. This reward scheme was proposed in [35]. The translation of 
[35] picks a pair in a Rabin automaton to satisfy, and gives positive and negative 
reward for the good and bad states of the pair, respectively. In general, picking 
the winning pair ahead of time is not possible [11]. For a Biichi automaton, this 
corresponds to giving positive (+1) rewards for accepting edges and zero rewards 
otherwise. While this reward scheme was shown to be not faithful [11] for general 
objectives, it is included for comparison purposes. 


3 Tool Design 


The primary design goal of Mungojerrie is to enable extensibility. To accomplish 
this, Mungojerrie separates different processing stages as much as possible so that 
extensions can reuse other components. We begin by presenting the architecture 
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Fig. 5. Architecture of Mungojerrie 1.1. 


of Mungojerrie. Afterwards, we take a closer at the novel slim Büchi automata 
plugin, which is described here in detail for the first time. 


Architecture of Mungojerrie. Mungojerrie begins its execution by parsing 
the input PRISM and HOA (see upper part of Fig. 5). The HOA is either read 
in from a file or piped from a call to one of the supported LTL translators. In 
particular the EPMC plugin from [13], an LTL translator capable of producing 
slim Büchi automata, is packaged with the tool. Requested automaton modifica- 
tions, such as determinization, are run after this step. If specified, Mungojerrie 
creates the synchronous product between the automaton and the model, and 
runs model checking or game solving [1,15,16]. The requested strategy and val- 
ues are returned. Due to this step, Mungojerrie has been connected to external 
linear program solvers. This enabled the extension of Mungojerrie to compute 
reward maximizing policies via a linear program for branching Markov decision 
processes in [18]. 
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If learning has been specified, the interpreter takes the automaton and model, 
without explicitly forming the product, and provides an interface akin to OpenAI 
Gym [4] for the RL agent to interact with the environment and receive rewards. 
When learning is complete, the Q-table(s) can be saved to a file for later use, 
and the interpreter forms the Markov chain induced by the learned strategy and 
passes it to the internal model checker for verification. 


LTL formula (1) HOA file (2) 
Yy x 
t late (SPOT) (3 4 
ranslate ( ) (3) parse (4) APMG plugin 
NTLBA (5) 
construct SBA (6) < > construct LDBA (7) 
LDBA (9) 


Vv 
minimize LDBA (10) 


SBA (8) minimized LDBA (11) 


Vv 
L> construct simulation game (12) 


Y 


/ simulation game (13) 


won H lost 
t--------4 game solver (14) F} ---=--- 


k i d 
HOA file (15) 


Fig. 6. Automata generation block diagram 


Slim Büchi Automata Generation. For reward schemes involving LTL, the 
w-regular automata translation is an important part of the design. Certain au- 
tomata may be more effective for learning than others. Slim Büchi automata 
[13] were designed with learning considerations in mind. The translator that 
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produces these automata is packaged with Mungojerrie. We will now describe 
its design in detail for the first time. 

We have implemented slim Biichi automata generation as a plugin of the 
probabilistic model checker EPMC [17]. The process is described in Fig. 6. The 
starting point is a transition-labeled Biichi automaton in HOA format [2] (2) 
or an LTL formula (1). In case we are given an automaton in HOA format, we 
parse this automaton (4) and if we are given an LTL formula, we use the tool 
SPOT [7] to transform the formula into an automaton (3). In both cases, we end 
up with a transition-labeled Biichi automaton (5). 

Afterwards, we have two options. The first option is to transform (6) this 
automaton into a slim Biichi automaton (8) [13]. These automata can then be 
directly composed with MDPs for model checking or used to produce rewards 
for learning. The other option is to construct (7) a suitable limit-deterministic 
Biichi automaton (SLDBA) (9). Automata of this type consist of an initial part 
and a final part. A nondeterministic choice only occurs when moving from the 
initial to the final part by an € transition (a transition without reading a charac- 
ter). SLDBA can be directly composed with MDPs. However, SLDBA directly 
constructed from general Büchi automata are often quite large, which in turn 
also means that the product with MDPs would be quite large as well. Therefore, 
we have implemented further optimization steps. We can apply a number of al- 
gorithms to minimize (10) this automaton so as to achieve a smaller SLDBA 
(11). To do so, we implemented several methods: 


— Subsuming the states in the final part with an empty language 

— Signature-based strong bisimulation minimization in the final part 

— Signature-based strong bisimulation minimization in the initial part 

— Language-equivalence of states in the final part 

— If we have a state s in the initial part for which we find a state s’ in the final 
part where the language of s and s’ are the same, we can remove all transitions 
of s and add an € transition from s to s’ instead. Afterwards, automaton states 
that cannot be reached anymore can be removed. 


Each of these methods has a different potential for minimization as well as 
runtime. We therefore allow to specify which optimizations are to be used and 
in which order they are applied. 

Once we have optimized the SLDBA, we could directly use it for later compo- 
sition with an MDP. Another possibility is to prove that the original automaton 
is already good for MDPs. If this is the case, then it is often preferable to use 
the original automaton: being constructed by specialized tools such as SPOT, it 
is often smaller than the minimized SLDBA. The original automaton is good- 
for-MDPs if it simulates the SLDBA [13]. If it does, then it is also composable 
with MDPs. Otherwise, it is unknown whether it is suitable for MDPs. In this 
case, sometimes more complex notions of simulation can be used, but existing 
decision procedures are too expensive to implement [36]. 

To show simulation, we construct (12) a simulation game, which in our case 
is a transition-labeled parity game (13) with 3 colors. We solve these games 
using (a slight variation of) the McNaughton algorithm [28]. (We are aware 
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that specialized algorithms for parity games with 3 colors exist [9]. However, so 
far the construction of the arena, not solving the game, turned out to be the 
bottleneck here). If the even player is winning, the simulation holds. Otherwise, 
more complex notions of simulation can be used, which however lead to larger 
parity games being constructed. In case the even player is winning for any of 
them, we can use the original automaton, otherwise we have to use the SLDBA. 
In any case, we export the result to an HOA file (15). For illustration and 
debugging , automata and simulation games can be exported to the GraphViz [8]. 


4 Case Studies 


To showcase how Mungojerrie can be used to experiment with different reward 
schemes, we provide three case studies. In the first case study, we demonstrate 
how Mungojerrie can be used to compare the effectiveness of two different re- 
ward schemes on the same system. In the second case study, we consider the 
design space of automata, and demonstrate how Mungojerrie can be used to 
compare how different w-automata change learning effectiveness. This is impor- 
tant for considering how to design LTL translators that produce automata that 
are effective for learning. In the last case study, we demonstrate how the dif- 
ferent outputs of Mungojerrie can be used. For additional experimental results 
obtained using Mungojerrie, we refer readers to [11,12,14,19,39,45,23] for case 
studies testing w-regular reward schemes, and [13] for the EPMC plugin. We 
also refer readers to [26, Fig. 3] which examined RL for scLTL properties, [6] for 
continuous-time MDPs, and [18], which extended Mungojerrie to test model-free 
reinforcement learning in branching Markov decision processes. 


4.1 Comparing Reward Schemes 


To demonstrate how Mungojerrie may be used to compare reward schemes, we 
compare the reward scheme of [11] with a modification of it that assigns a +1 
reward on every accepting edge, as introduced in [12]. We compare these two 
methods on the same problem, where the learner must safely navigate two robots 
on a slippery gridworld to a goal. We also fix the problem parameters Ç = 0.99 
and y = 0.99999, and the use of Q-learning. Since we are interested in which 
method will converge sooner, we fix the amount of training to be relatively low. 
We allow the two parameters specific to Q-learning, the learning rate a and the 
exploration rate £, to be varied in order to find the optimal combination for 
each method. We average 10 runs for each grid point. This required 32000 runs, 
which took approximately 79 CPU hours (single-core) on a 2.5GHz Intel Xeon 
E5-2680 v3. This corresponds to an average of approximately 188000 sampled 
transitions per second per core, including model checking time. This sampling 
rate is typical of what was observed in other experiments. 

Figure 7 shows the probability of satisfaction of the learned strategy as com- 
puted by the model checker of Mungojerrie. One can see that under these con- 
ditions, the reward scheme from [12] is able to consistently learn probability 
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Fig. 7. Probability of satisfaction of learned strategies as computed by the model 
checker of Mungojerrie. ‘Hahn et al. 19’ refers to the translation of [11]. ‘Hahn et 
al. 20’ refers to the translation of [12] that assigns +1 reward on every accepting edge 
with reachability parameter ¢. Each grid point is the average of 10 runs. 


1 strategies under certain parameter combinations, while [11] does not. Fig- 
ure 8 shows the difference in the estimated probability of satisfaction, found by 
taking the value from the initial state of Q-table and renormalizing it appropri- 
ately, and the probability of satisfaction of the learned strategy computed by 
the model checker of Mungojerrie. One can see that the reward scheme of [11] 
sometimes overestimates and sometimes underestimates when it achieves a high 
actual probability of satisfaction under these conditions. However, on the same 
example, the reward scheme of [12] consistently underestimates everywhere. In 
summary, Mungojerrie allowed us to see that, although the reachability reward 
scheme of [12] may achieve higher probabilities of satisfaction sooner, it may 
take longer for the values in the Q-table to properly converge. 


4.2 Comparing Automata 


An w-regular objective may be described by different automata, many of which 
may be good-for-MDPs. Mungojerrie can be used to compare the effectiveness 
of such automata when used in RL. Consider the two nondeterministic Büchi 
automata shown in Fig. 9. Both are equivalent to the LTL formula (F Gz) V 
(GFy), but the one on the right should be better for learning: long transient 
sequences of observations that satisfy x A sy may convince the agent to commit 
to State 1 of the left automaton too soon. 

To test this conjecture, we specified a model in PRISM organized in two long 
chains. In one of them the agent sees many zs for a while, but eventually only 
sees ys. In the other chain the situation is reversed. Which chain is followed is up 
to chance. We then used the reward scheme from [3] with Q-learning under the 
default hyperparameters in Mungojerrie, yg = 0.99, y = 0.99999, a = 0.1, and 
€ = 0.1. We then trained for 20000 episodes under each automaton, and used 
Mungojerrie to compute the probability of satisfaction of the property at periodic 
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Fig. 8. Estimated probability of satisfaction of learned strategies minus the probability 
of satisfaction computed by the model checker of Mungojerrie. Blue indicates under- 
estimation, while red indicates overestimation. Hahn et al. 19 refers to the translation 
of [11]. Hahn et al. 20 refers to the translation of [12] that assigns +1 reward on every 
accepting edge with reachability parameter ¢. Each grid point is the average of 10 runs. 


intervals. Since learning to control the left automaton requires thorough and deep 
exploration, we conjectured that optimistic intialization of the Q-table [41] to 
the value 0.8 will improve performance. We took the average of 1000 runs for 
each combination. 

Figure 10 shows the resulting curve. When using the LDBA without opti- 
mistic intialization, the learning agent is unable to learn the optimal strategy 
under these conditions. While it is worth noting that using the LDBA with- 
out optimistic initialization eventually converges to the optimal strategy with 
enough training, it is clear that the choice of the automaton can have a signifi- 
cant impact on learning performance. Therefore, the design of translations from 
LTL to automata has a role to play in producing effective reward schemes. 


x Ay 


x Ay 
a a OUS TD C 


y y 


Fig. 9. Equivalent, but not equally effective, Büchi automata. “LDBA” and “Forgiv- 
ing” refer to the automaton the left and right, respectively. 
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Evolution of probability of satisfaction 
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Fig. 10. Plot of the evolution of the probability of satisfaction of learned strategies as 
computed by the model checker of Mungojerrie. “Forgiving” and “LDBA” refer to the 
left and right automata in Figure 9, respectively. “(optimistic)” indicates optimistic 
initialization of the Q-table was used. Each curve is the average of 1000 runs. 
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Fig. 11. A grid-world stochastic game arena (left) and a deterministic parity automa- 
ton for the objective (right). 


4.3 A Game of Pursuit 


Figure 11 describes a stochastic parity game of pursuit in which the Max player 
(M) tries to escape from the Min player (m). At each round, each player in turn 
chooses a direction to move. If movement in that direction is not obstructed 
by a wall, then the player moves either two squares or one square with equal 
probabilities. One square of the grid is a trap, which m must avoid at all times, 
but M may visit finitely many times. Player M should be at least 5 squares away 
from player m infinitely often. This objective is described by the LTL property 
(F atrapmn) V ((F Gatrapmx) A (GF -close)), where trapmn and trapmx are 
true when m and M visit the trap square, respectively, and close is true when 
the Manhattan distance between the two players is less than 5 squares. This 
objective translates to the deterministic parity automaton in Fig. 11, which 
accepts a word if the maximum recurring priority of its run is odd. 

Unlike the example of Fig. 2, inspection of the Markov chain induced by 
an optimal strategy and manual verification of the optimality of the learned 
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Fig. 12. Max player learned strategy for the game of Fig. 11 when the automaton is 


in State 0. (Any strategy will do when the automaton is in State 1.) In each 6 x 6 box 


the rose-colored square is the position of the minimizing player, while the light-blue 


square marks the trap. 


an extensible tool for experimenting with re- 


ward schemes for RL, with a focus on w-regular objectives. Mungojerrie allows 


the specification of models in PRISM [25] and w-automata in HOA [2]. Mul- 


? 


strategy is impractical. Instead, the model checker of Mungojerrie has verified the 


optimality of this strategy from the intial state. For visualization, Mungojerrie 
graphical representation like the one of Fig. 12. The color gradient shows that, 


can also save the strategy in CSV format. Postprocessing can then produce a 
in the main, M’s strategy is to move away from m. 


We have introduced Mungojerrie 


5 Conclusion 


542 E. M. Hahn et al. 


tiple LTL translators can be called from the tool [7,24], including the EPMC 
plugin introduced in [13] for the construction of slim Biichi automata. Mungojer- 
rie includes various reward schemes [11,3,12,14,19,23,35] for w-regular objectives 
and model-free RL algorithms [43,20,40,23]. Mungojerrie also includes an inter- 
nal probabilistic model checker for the verification of learned strategies against 
w-regular objectives, and for allowing users to verify that developed examples 
are as intended. The tool also comes packaged with benchmarks for w-regular 
objectives in RL. 

We have discussed Mungojerrie’s design and demonstrated how Mungojerrie 
can be used to perform comparisons of reward schemes for w-regular objectives. 
The source and documentation of Mungojerrie are publicly available. 
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Abstract. CHERI-C extends the C programming language by adding 
hardware capabilities, ensuring a certain degree of memory safety while 
remaining efficient. Capabilities can also be employed for higher-level se- 
curity measures, such as software compartmentalization, that have to be 
used correctly to achieve the desired security guarantees. As the exten- 
sion changes the semantics of C, new theories and tooling are required 
to reason about CHERI-C code and verify correctness. In this work, we 
present a formal memory model that provides a memory semantics for 
CHERI-C programs. We present a generalised theory with rich proper- 
ties suitable for verification and potentially other types of analyses. Our 
theory is backed by an Isabelle/HOL formalisation that also generates 
an OCaml executable instance of the memory model. The verified and 
extracted code is then used to instantiate the parametric Gillian pro- 
gram analysis framework, with which we can perform concrete execution 
of CHERI-C programs. The tool can run a CHERI-C test suite, demon- 
strating the correctness of our tool, and catch a good class of safety 
violations that the CHERI hardware might miss. 


Keywords: CHERI-C - Hardware Capabilities - Memory Model - Se- 
mantics - Theorem Proving - Verification 


1 Introduction 


Despite having been developed more than 40 years ago, C remains a widely used 
programming language owing to its efficiency, portability, and suitability for low- 
level systems code. The language’s lack of inherent memory safety, however, has 
been the source of many serious issues [18]. While there have been significant ef- 
forts aimed at vulnerability mitigation, memory safety issues remain widespread, 
with a recent study stating that 70% of security vulnerabilities are caused by 
memory safety issues [31]. 

The Capability Hardware Enhanced RISC Instructions (CHERI) project of- 
fers an alternative model that provides better memory safety [44]. Its main fea- 
tures include a new machine representation of C pointers called capabilities and 
extensions to existing Instruction Set Architectures (ISA) that enable the se- 
cure manipulation of capabilities. Capabilities are in essence memory addresses 
bound to additional safety-related metadata, such as access permissions and 
bounds on the memory locations that can be accessed. As the hardware per- 
forms the safety checks on capabilities, legacy C programs compiled and run 
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tag :: 1 bit 
reserved perm :: 31 bits 
tag :: 1 bit 
length :: 64 bits 
base :: 64 bits perm :: 15 bits reserved bounds :: 41 bits 
addr :: 64 bits addr :: 64 bits 
(a) CHERI-256 Capability Layout (b) CHERI-128 Capability Layout 


Fig. 1: Simplified CHERI Capability Layouts 


on CHERI architecture, i.e. CHERI-C code, acquire hardware-ensured spatial 
memory safety, while retaining efficiency. Porting code from one language to 
another generally requires significant efforts. But porting C codes to CHERI-C 
requires little, if any, changes to the original code to ensure the code runs on 
CHERI hardware [36,39]. 

In 2019, the UK announced its Digital Security by Design programme with 
£190 million of funding distributed over more than 26 research projects and 5 
industrial demonstrators [6] to ‘radically update the foundation of our insecure 
digital computing infrastructure, by demonstrating that mainstream processor 
technology ... can be updated to include new security technologies based on 
the CHERI Architecture’ [5]. A cornerstone of the programme is Morello [4], a 
CHERI-enabled prototype developed by Arm. 

Over the several years that lead to the realisation of Morello, there were 
several design revisions made to the hardware; examples are depicted in Fig. 1. 
The refined designs used methods for compression of bounds that reduced cache 
footprints and improved overall performance while minimising incompatibil- 
ity. Morello uses a very similar design to the compressed scheme for capa- 
bilities depicted in Fig. 1b, with the overall bit-representation of the layout 
differing slightly. Future capability designs may possibly incorporate a different 
bit-representation design, provided there are improvements in performance or 
compatibility. Due to the ever-changing design of capability bit-representations, 
it seems best to have an abstract representation of capabilities, so that CHERI- 
based verification tools can remain modular. 

Checking for memory safety issues of legacy C code can, of course, be achieved 
using existing analysis tools for C, but there are new problems that arise when 
such code is run on CHERI hardware. Because the pointer and memory represen- 
tations are fundamentally different in a CHERI architecture, there are non-trivial 
differences in the semantics between C and CHERI-C. 

To illustrate this point, consider the C code in Listing 1.1. This code segment 
performs memcpy twice: once from a to b, where pointers/capabilities are stored 
misaligned in b, then from b to c, where pointers/capabilities are stored correctly 
again in c. In standard C, there are no problems accessing the pointer stored 
in c. But in CHERI-C, misaligned capabilities in memory are invalidated. That 
means the address and meta-data of the misaligned capabilities are accessible, 
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but such capabilities can no longer be dereferenced [41]. While c will contain the 
same capability value as that of a, the capability stored in c is invalidated. Thus, 
the last line will trigger an ‘invalid tag’ exception when the code is executed on 
ARM Morello and other CHERI-based machines. 


1 #include <stdlib.h> 
2 #include <string.h> 
3 void main(void) { 

4 int «n = calloc(sizeof(int), 1) 
5 int «*a = malloc(sizeof(int «) ) 


r 
r 


6 xa = n; 
7 int xxb = malloc (sizeof (int x) x 2); 

8 int «*c = malloc (sizeof (int «)); 

9 memcpy ( (char +) b + 1, a, sizeof (int «)); 
10 memcpy (c, (char *) b + 1, sizeof(int *)); 
11 int x = «xc; 


Listing 1.1: C code example 


Of course, existing C analysis tools cannot catch these cases, as such tools are 
not only unaware of the changes in the semantics that capabilities bring, but also 
the code is not problematic in conventional C. Moreover, while CHERI ensures 
spatial safety by the hardware, CHERI is still incapable of catching temporal 
safety violations, such as Use After Free (UAF) violations. There exists work that 
attempt to address temporal safety [11, 17,42], but they are either a software- 
implemented solution [42], where overall performance is inevitably affected, or 
ongoing work [11]. There is, therefore, a need for program analysis tools that 
correctly integrate the semantics of CHERI-C. 

To the best of our knowledge, there is no prior work on formalising a CHERI- 
C memory model. The Cerberus C work [30] is primarily designed to capture 
pointer provenance of C programs and uses CHERI-C as a reference for pointer 
provenance, but the tool lacks a formal CHERI-C memory model. ESBMC is 
a verification tool that supports CHERI-C code [15]. But support for tagged 
memory does not yet exist; ESBMC would not be able to catch the ‘invalid tag’ 
exception in the code in Listing 1.1. Furthermore, ESBMC’s memory model is 
not formally verified. Users of ESBMC must trust that the implementation of 
the memory model and its underlying theory are correct. SAIL formalisations 
for each CHERI architectures exist [3,8,9], but they only capture the low-level 
semantics of the architecture and not high-level C constructs such as allocation. 

In this paper, we introduce a formal CHERI-C memory model that captures 
the memory semantics of the CHERI-C language. In Sect. 3, We formalise the 
memory and its operations and prove essential properties that provide correct- 
ness guarantees. We provide a rigorous logical formalisation of the CHERI-C 
memory model in Isabelle/HOL [32] (in Sect. 4.1) and use the code generation 
feature to generate a verified OCaml instance of the memory model [21]. We then 
show, in Sect. 4.2, the practical aspects of this work by providing the memory 
model to, and thereby instantiating, Gillian [20], a general, parametric verifica- 
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tion framework that supports concrete and symbolic execution and verification 
based on separation logic, backed by rich correctness properties. In Sect. 5, we 
demonstrate that the tool can capture the semantics of CHERI-C programs cor- 
rectly. A discussion on the existing works can be found in Sect. 6 while Sect. 7 
concludes this paper mentioning possible future directions. We first start with 
an introduction to the CHERI architecture. 


2 CHERI 


CHERI extends a conventional ISA by introducing capabilities which are essen- 
tially pointers that come along with metadata to restrict memory access. The 
ISA now has additional hardware instructions and exceptions that operate over 
capabilities. Register sets are extended to include capability registers, instruc- 
tions are added that reference the capability registers, and custom hardware 
exceptions are added to block operations that would violate memory safety. De- 
signs of CHERI capabilities have refined over the past several years and have 
been incorporated in several existing architectures, such as MIPS and RISC- 
V [40]. All CHERI-extended ISAs have been formally defined using the SAIL 
specification language, in which the logic of machine instructions and memory 
layout have been defined formally in a first-order language [13]. 

Regardless of the layout, CHERI capabilities include three important types 
of high-level information, in addition to a 64-bit address: 


— Permissions. Permissions state what kind of operations a capability can 
perform. Loading from memory and storing to memory are examples of per- 
missions a capability may possess. 

— Bounds. Bounds stipulate the memory region that the address part of a 
capability can reference. The lower bound stipulates the lowest address that 
a capability may access, and the upper bound stipulates the highest address. 

— Tag. Stored separately from the other components of a capability, the tag 
states the validity of the capability it is attached to. Capabilities with invalid 
tags can hold data but cannot be dereferenced. Attempts to forge capabilities 
out of thin air result in a tag-invalidated capability. 


Fig. la show a 256-bit representation of a capability, which was one of the 
earlier designs. The lower and upper bounds are represented using the base and 
length fields. Here, the lower bound is the address stated by the base field, and 
the upper bound is the address in the base field plus the length field. Permis- 
sions and other metadata are stored in the remaining fields as a bit vector. The 
capability’s tag bit exists separately from the capability. Tag bits are, in prac- 
tice, stored separately from the main memory where capabilities reside, so users 
cannot manipulate the tag bits of capabilities stored in memory. Furthermore, 
overwriting capabilities stored in memory with non-capability values invalidates 
their tag bits, which ensures capabilities cannot be forged out of thin air. 

This representation, in theory, exercises a high level of compatibility with ex- 
isting C code. But performance, particularly with regards to caching, is reduced 
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due to the size of the capability representation [43]. Refined designs ultimately 
resulted in a capability that utilises a floating-point-based lossy compression 
technique on the bounds [43], such as the one depicted in Fig. 1b. In many 
cases, the upper bits of the address fields are most likely to overlap with those 
of the lower and upper bounds. Knowing this, bounds can be compressed by 
having the upper bits of their fields depend on that of the address, which means 
only the lower bits need to be stored. 

The lossy compression of bounds may result in some incompatibility. Bounds 
may no longer be represented exactly, and changes in the address field may 
result in an unintentional change in the bounds. Nonetheless, such representa- 
tions give an acceptable level of compatibility, provided aggressive pointer arith- 
metic optimisations are avoided. The Morello processor incorporates a similar 
compression-based design in its architecture, though sizes of each field differ [12]. 

The added capability-aware instructions operate over capabilities. Conven- 
tional load and store operations are extended to first check that the tag, permis- 
sions, and bounds of the capability are all valid. Violations result in triggering a 
capability-related hardware exception. There are additional operations to access 
or change the tag, permissions, and bounds. To ensure spatial memory safety, 
these operations can, at most, make the conditions for execution more restric- 
tive; they cannot grant that which was not previously available. For instance, one 
cannot lower the lower bound of a capability to access a region that was inacces- 
sible before, or grant a store permission that was unset beforehand. Because of 
how tags work for capabilities stored in memory, one cannot grant capabilities 
larger bounds or more permissions by manipulating the memory—attempting 
this results in tag invalidation. 

Library support for CHERI has grown over the past few years. In particular, 
a software stack for CHERI-C that utilises a custom Clang compiler now exists 
[41]. Users can compile their program either in ‘purecap’ mode, where all pointers 
in programs are replaced with capabilities, or in ‘hybrid’ mode, where both 
pointers and capabilities co-exist within the program. Because operations that 
change the fields of a capability does not generally exist in standard C, Clang 
incorporates additional CHERI libraries of operations that users may use to 
access or mutate capabilities. 


3 CHERI-C Memory Model 


Incorporating hardware-enabled spatial safety requires significant changes to the 
C memory model. Pointer designs must be extended to incorporate bounds, 
metadata, and the out-of-band tag bit. The memory, i.e. heap, must also be able 
to distinguish the main memory and the tagged memory. Operations with respect 
to the heap must also be defined such that tag preservation and invalidation are 
incorporated appropriately. 

In this section, we provide a generalised theory for the CHERI-C memory 
model. We identify the type and value system used by the memory model. We 
then define the heap and the core memory operations. Finally, we state some 
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essential properties of the heap and the operations that (1) characterises the se- 
mantics and (2) states what types of verification or analyses could be supported. 
We make the assumption that we work on a ‘purecap’ environment, where all 
pointers have been replaced with capabilities. 


3.1 Design 


The CHERI-C memory model is inspired by that of CompCert [26]. The beauty 
of CompCert is that it is a verified C compiler. The internal components, which 
include the block-offset based memory model, are formalised in a theorem prover, 
with many of its essential properties verified. Using CompCert’s memory model 
as a basis, we design the CHERI-C memory model by providing extensions to 
ensure the modelling of correct semantics and the capture of safety violations: 


— Capability Values. In addition to the standard primitive types, we incor- 
porate abstract capabilities as values. We also incorporate capability frag- 
ments to provide semantics to higher-level memory actions like memcpy, 
which should preserve tags if copied correctly and invalidate otherwise [41]. 

— Extended Operations. Basic memory actions such as load and store 
now work on capabilities and will trigger the correct capability-related ex- 
ception when required. 

— Tagged Memory. Tags in memory are stored separately from the main 
heap, as could be seen by the formal CHERI-MIPS SAIL model [9]. So we 
provide a separate mapping for tagged memory for storing capability tags. 

— Freed Regions. The standard CompCert memory model can mark which 
memory regions are valid but lacks the ability to distinguish which regions 
are marked as ‘Freed’. We incorporate freed regions as a means to catch 
temporal safety violations. 


3.2 Type and Value System 


Figure 2 shows the formalisation of CHERI-C types and values. Types 7 are anal- 
ogous to chunks in CompCert terms. Types comprise primitive types (e.g. U87, 


> 


r 4 U8, | 88, |... | U64, | S64, | Cap, 
MCap £ Bx Zx md 
Cap £ MCap xB 
Ve £ U8y :: 8 bits | acs 
| S64y :: 64 sbits 
| Cappy :: Cap 
| CapFy :: Cap x N 
| Undef 


Ym 4 Byte :: 8 bits 
| MCapF :: MCap x N 


Fig. 2: CHERI-C Types and Values 
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S64,, etc.) and a capability type Cap,. We define a function |-| : 7 — N that 
returns, in terms of bytes, the size of the type. For Cap+, the value is not fixed 
but requires that it must be divisible by 16. This requirement allows capabilities 
with 128- and 256-bit representations to have a valid size. 

MCap represents a memory capability value and is represented as a tuple 
(b,i,m), which comprises the block identifier b € B, offset i € Z, and metadata 
m E md, where md represents the bounds and permissions. Here, B must be 
a countable set. Offsets are represented as integers, as CHERI allows out-of- 
bounds addresses, where the address may be lower than the lower bound. Because 
capabilities stored in memory have their tag bit stored elsewhere, we make the 
distinction between memory capabilities and tagged capabilities, Cap, which is a 
capability ((b,7,m),t) that contains the tag bit t € B. 

Unlike those of CompCert, CHERI-C values Ve are given type distinctions to 
ensure: (1) types can be inferred directly, and (2) they contain the correct values 
at all times. From a practical standpoint, this ensures that the proof of correct- 
ness of memory operations can be simplified, and bounded arithmetic operations 
can be implemented correctly. Capability values Cap, and capability fragment 
values CapFy also exist as values. Provided some capability value C € Capy, 
capability fragment values Cn € CapFy correspond to the n-th byte of the ca- 
pability C. For both cases, instead of fixing their representation concretely, we 
represent them abstractly using a tuple. This representation ensures that con- 
version to a compressed representation could be achieved when needed while 
avoiding the need to fix to one particular bit representation. Furthermore, this 
approach provides a reasonable way to correctly define memcpy, where capabil- 
ity tags must be preserved if possible. While capability fragments are extended 
structures of capabilities, operations that can be performed on capability frag- 
ments are limited. Finally, we have Undef , which represents invalid values. These 
values may appear when, for example, the user calls malloc and immediately 
tries to load the undefined contents. The idea behind incorporating capability 
fragments values is heavily inspired by the work from [25]. 

Because values are given a type distinction, identifying the types of values is 
straightforward. For capability fragments, we have two choices: they may either 
be a U8, or S8, type. Capability fragments are essentially bytes, so operations 
over capability fragments can be treated as if they were a U8, or $8, type. Since 
Undef does not correspond to a valid value, it is not assigned a type. 


CapErr + TagViolation | PermitLoadViolation | ... 
LogicErr = UseAfterFree | MissingResource | ... 
Err * CapErr | LogicErr 
Rp £ Succp 

| Fail Err 


Fig. 3: CHERI-C Errors 
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Memory operations, such as load and store, are defined so that, upon 
failure, the operation returns the type of error that lead to the failure. In general, 
partial functions, or function using the option type, can model function failure 
but cannot state what caused the failure. As such, the operations use the return 
type R p, where p is a generic return type. For CHERI-C, we make the distinction 
between errors caused by capabilities, denoted by CapErr, and errors caused by 
the language, denoted by LogicErr. Figure 3 depicts the formalised Errors system 
used by the memory model. 


3.3 Memory 


We now formalise the memory. We use CompCert’s approach of using a union 
type Vm that can represent either a byte or a byte fragment of a memory 
capability. Then it is possible to create a memory mapping N — Vm.! We also 
create a separate mapping of type N — B for tagged memory. When the user 
attempts to store a capability, it will be converted into a memory capability and 
then stored in the memory mapping. Separately, the tag bit will be stored in 
the tagged memory. When the tag bit is stored, adjustments are made to ensure 
tags are only stored in capability-size-aligned offsets. 

To ensure we can catch temporal safety violations, we need to be able to 
make distinctions between blocks that are freed and blocks that are valid. One 
way to encode this is as follows: a block b may point to either a freed location 
(i.e. b+ Ø), or point to the pair of maps we defined earlier. The idea is that if a 
block identifier points to a freed block, attempts to load such a block will trigger 
a ‘Use After Free’ violation and would otherwise point to a valid mapping pair. 
Ultimately, the heap has the following form: 


H : B — (N — Vm) x (N= B))g 


3.4 Operations 


We define the core memory operations, or actions, of the memory model. We 
use the same result type R given in Fig. 3 instead of using a partial function to 
give the type of error, should the operation fail. 

The memory actions Ag = {alloc,free,load,store} are given below 
with their respective signatures: 


— alloc: H >N > R (H x Cap) 
fr : H > Cap > R (H x Cap) 

— load : H > Cap >T >R (Ve) 
store: H > Cap > Ve > R (H) 


'The notation — denotes a partial map. Offsets in heaps are N, whereas offsets 
stored in capabilities are Z. Operations check whether the offsets are in bounds, which 
requires offsets to be non-negative. This means valid offset values can be converted 
from Z to N without issues. 
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The function alloc u n = Succ (y’,c) takes a heap p and size n input and 
produces a fresh capability c and the updated heap u’ as output. The bounds of 
c are determined by n. In the case of compressed capabilities, a sufficiently large 
n may result in the upper bound being larger than what was requested. The 
capability c is also given the appropriate permissions and a valid tag bit. Like 
that of CompCert, alloc is designed to never fail, provided that the countable 
set B has infinite elements. 

The function free u c = Succ (p’',c’) takes a heap u and capability c = 
((b, i,m), t) as input. Upon success, the operation will return the updated heap, 
where we now have b+ Ø. The capability c’ is also updated such that the tag 
bit of c is invalidated. This conforms to the CHERI-C design stated in [41]. We 
note that c should also be a valid capability, that is—at the very least—the 
tag bit should be set, and the offset should be within the capability bounds. 
The function free may fail if the block is invalid or already freed, even if the 
capability itself was valid. In such case, free returns a logical error. 

The function load u ct = Succ v takes a heap u, capability c and type t 
as input, where t is the type the user wants to load. Upon success, the operation 
will return the value v from the memory, where v has the corresponding type 
t.? Before load attempts to access the block provided by c, it first checks that 
c has sufficient permissions to load. We use the CHERI-MIPS SAIL implemen- 
tation of the CL[C] instruction [40] for the capability checks, implementing the 
extra checks provided that t = Cap+. Once the capability checks are done, the 
operation attempts to access the blocks and the mappings, failing and returning 
the appropriate logical error if they do not exist. 

When accessing both the main memory and tagged memory, there are a 
number of cases to consider. When loading primitive values, it is important that 
the region about to be loaded is all of Byte and not of MCapF type. Thus, before 
loading the values, we check whether the contiguous region in memory are all 
of Byte type. If this is not the case, load will return Undef. For capability 
fragments, the cell in memory has to be an MCapF. Finally for capabilities, not 
only do the contiguous cells have to be of MCapF type, but (1) they must have 
the same memory capability value, and (2) the fragment values must all be a 
sequence forming {0,1,...,|Cap,| — 1}. The idea is that even if the contiguous 
cells have the same memory capability values, they do not form a valid capability 
if the fragments are not stored in order. After all the checks, the tagged memory 
will be accessed, where the tag value is retrieved.” The loaded memory capability 
and tag bit are then combined to form a tagged capability, which load returns. 

The function store ucv = Succ p’ takes a heap u, capability c, and value v. 
Upon success, the operations will return the updated heap pi’. Like load, store 
performs the necessary capability checks based on CHERI-MIPS’ CS[C] instruc- 
tion and attempts to access the blocks and mappings afterwards, returning the 
appropriate exception upon failure. For storing primitive values and capability 


For capability fragments, the corresponding type may be either U8, or 98+. 
3The tagged memory does not need to be accessed if c does not have a capability 
load permission. In such case, the loaded capability will have an invalidated tag. 
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fragment values, the main memory mapping will simply be updated to contain 
the values, and the associated tagged memories will be invalidated. For primi- 
tive values that are not bytes, the values will be converted into a sequence of 
bytes, where each byte in the list will be stored contiguously in memory. For 
a capability fragment value, it will be stored in the cell as an MCapF type, 
where the tag value of the fragment will be stripped when storing in mem- 
ory. Finally, for capability values, the value will be split into a list comprising 
|Cap,| — 1 memory capability fragments, with the fragment value forming a se- 
quence {0, 1, ..|Cap-|— 1}, and a tag bit. The main memory will store the list of 
memory fragments contiguously, and the tagged memory will store the tag value 
in the corresponding capability-aligned tagged memory. 


3.5 Properties 


In the previous section, we have articulated a formal CHERI-C memory model, 
explaining how the heap is structured and how the operations are defined. It is 
essential that the formalisation we provided is correct and is also suitable for 
verification or other types of analyses. In this section, we first discuss the proper- 
ties of the memory. We then discuss the properties of the operations themselves, 
primarily concerned with correctness. 

When we observe the memory, it is important that we always work with a 
valid one, i.e. the memory is well-formed. In our formalisation, we require that 
all tags in the tagged memory are stored in a capability-aligned location. The 
well-formedness relation we is defined as follows: 


WẸ (u) = Vb € dom(u). b> (c, t) — Va € dom(t). x mod |Cap,| = 0 


The well-formedness property must hold when the heap is initialised and 
when memory operations mutate the heap. That is, provided uo is the initialised 
heap where all mappings are empty, a € Ac is a memory action, v are the 
arguments of the memory operation a and p’ is one of the return values denoting 
the updated heap, we have the following properties: 


Wy (Ho) 


WẸ (u) = a u v = Succ u => WẸ (W’) 


The two properties above ensure that the heap is well-formed throughout the 
execution of the CHERI-C program. 

For the correctness of the operations, we primarily consider soundness and 
completeness: 


— If the inputs are valid for operation œ € Ac then the action should succeed. 

— If the action a succeeds, the inputs provided to the operations are valid. 

— If the inputs are invalid for the operation a, then the action should fail and 
return the correct error. 
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The first and second points are simple soundness and completeness properties. 
The third point is important in that the input may be problematic in many ways. 
For example, the NULL capability has an invalid tag bit, invalid bounds, and no 
permissions. The function Load will fail if provided with the NULL capability, 
as it violates many of the checks. Because the SAIL specification states that tags 
are always checked first, the error must be a TagViolation type. 

Next, we need to ensure successive operations yield the desired result. The 
primary properties to consider are the good variable laws [26]; examples of prop- 
erties encoding this law include load after allocation, load after free, and load 
after store. It is worth mentioning there are some caveats. For example, the 
load after store case no longer guarantees that you will retrieve the same value 
you stored, unlike CompCert’s load after store property in [26], since the value 
that was stored and to be loaded again could have been either a capability 
or capability fragment. In such cases, the tag bit may become invalidated due 
to insufficient permissions on the capability, or because storing capability frag- 
ments resulted in the tagged memory being cleared. The solution is to divide 
the general property into a primitive value case and a capability-related value 
case. Ultimately, the idea is to prove that the loaded value is correct rather than 
exact, i.e. capability-related values when loaded with have the correct tag value. 

Finally, we have properties suitable for verification. We note that the memory 
H can be instantiated as a separation algebra by providing the partial commu- 
tative monoid (PCM) (H, W, uo), where W is the disjoint union of two heaps and 
Ho is the empty initialised heap. For tools that rely on using partial memories, it 
is also imperative to show that the well-formedness property is compatible with 
memory composition: 


WE (p1 Y u2) => WE (m1) A WE (u2) 


We also note that the current heap design keeps track of negative resources [28], 
which may potentially be useful for incorrectness logic based verification [33]. 


4 Application 


The overall memory model provided in Sect. 3 has been designed to be appli- 
cable for verification tools. In this section, we explain how we use the theory 
provided above to create a verified, executable instance of the memory model. 
We then explain how this executable model can be used to instantiate a tool 
called Gillian [20]. Using the instantiated tool, we demonstrate the concrete 
execution of CHERI-C programs with the desired behaviour. 


4.1 Isabelle/HOL 


Isabelle/HOL is an interactive theorem prover based on classical Higher Or- 
der Logic (HOL) [32]. We use Isabelle/HOL to formalise the entirety of the 
CHERI-C memory model discussed in Sect. 3. Types, values, heap structure, 
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etc. were implemented, memory operations were defined, and properties relat- 
ing to the heap and the operations were proven. Memory capabilities, tagged 
capabilities, and capability fragments were represented using records, a form of 
tuple with named fields. For code generation, we instantiated the block type B 
to be Z. For showing that H is an instance of a separation algebra, we use the 
cancellative_sep_algebra class [23] and prove that the heap model is an 
instance. This proof ultimately shows that H forms a PCM. Proving that well- 
formedness is compatible with memory composition is stated slightly differently. 
The cancellative_sep_algebra class takes in a total operator -; instead of 
a partial one and requires a ‘separation disjunction’ binary operator #, which 
states disjointedness. Ultimately, the compatibility property can be given as: 


pa # pp = WẸ (t i p2) => WẸ (m1) A WẸ (u2) 


For partial mappings of the form A — B, we use Isabelle/HOL’s finite mapping 
type (‘a,’b) mapping [22]. To ensure we obtain an OCaml executable instance 
of the memory model, we use the Containers framework [27], which generates 
a Red-Black Tree mapping provided the abstract mapping in Isabelle/HOL. All 
definitions in Isabelle were either defined to be code-generatable to begin with 
(i.e. definitions should not comprise quantifiers or non-constructive constants 
like the Hilbert choice operation SOME), or code equations were provided and 
proven to ensure a sound code generation [21]. For bounded machine words, 
which is required for formalising the primitive values, we use Isabelle/HOL’s 
word type ‘a word, where ‘a states the length of the word [14]. Types like 
‘a word, nat, int and string were also transformed to use OCaml’s Zarith 
and native string library for efficiency [21]. 


4.2 Gillian 


Gillian is a high-level analysis framework, theoretically capable of analysing a 
wide range of languages. The framework allows concrete and symbolic execu- 
tion, verification based on Separation Logic, and bi-abduction [28]. The crux of 
the framework lies in its parametricity, where the tool can be instantiated by 
simply providing a compiler front end and OCaml-based memory models of the 
language. So far, CompCert C and JavaScript have both been instantiated for 
Gillian, giving birth to Gillian-C and Gillian-JS. 

The underlying theoretical foundation of Gillian has its essential correctness 
properties like soundness and completeness already proven [20,29]. Thus, users 
who instantiate the tool only need to prove the correctness of the implementation 
of their compiler and memory models to ensure the correctness of the entire tool. 
From the perspective of someone trying to instantiate Gillian with their compiler 
and memory models, it is essential to understand the underlying intermediate 
language GIL and the overall memory model interface used by Gillian. 


GIL GIL is the GOTO-based Intermediate Language used by Gillian which 
is used for all types of analyses the tool supports. For concrete execution, GIL 
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supports basic GOTO constructs and assertions. For symbolic execution, the GIL 
grammar is extended to support path cutting, i.e. assumptions, and generation 
of symbolic variables. For separation logic based verification, the GIL grammar is 
further extended to support core predicates and user-defined predicates [28] that 
can be utilised to form separation logic based assertions. Furthermore, function 
specifications in the Hoare-triple form {P}f(%){Q} can be provided, where P 
and Q are separation logic based assertions. 

Note that Gillian uses a value set V which differs from that used in the 
CHERI-C memory model. As we are only interested in the values used in the 
CHERI-C memory model, it is possible to implement a thin conversion layer 
between the two value systems. We note that a list of GIL values also constitutes 
a GIL value, so arguments for functions can be expressed as a single GIL value. 
This is important when understanding the memory model layout of Gillian. 


Memory Model Memory Models in Gillian have a specific definition and have 
properties that state what kind of analysis is supported. Proving that the pro- 
vided memory models satisfy certain properties is essential in understanding 
what the instantiated tool supports. 

Gillian differentiates between concrete and symbolic memory models, which 
are used for concrete and symbolic execution, respectively. As we are concerned 
with concrete execution, we will consider only concrete memory models here. 

At the highest level, there are two kinds of memory model properties: exe- 
cutional and compositional. The executional memory model states properties a 
memory model must have for whole-program execution, and the compositional 
memory model states properties a memory model must have for separation logic 
based symbolic verification. Each paper in the Gillian literature states slightly 
different definitions for the memory models [20, 28, 29,37|—in Definitions 1 and 
2 below, we present unified, consistent definitions for each of the memory model 
properties. We ignore contexts, as there exists only one context in concrete mem- 
ories, which is the GIL boolean value true. 


Definition 1. (Execution Memory Model). Given the set of GIL values V and 
an action set A, an execution memory model M(V,A) = (\|M|, Wy, ea) com- 
prises: 


1. a set of memories |M| > u 
2. a well-formedness relation Wy C |M|, with Wy(u) denoting u is well-formed 
3. the action execution function ea: A > |M| > V > R (|M| x V) 


Definition 2. (Compositional Memory Model). Given the set of GIL values V 
and core predicate set I’, a compositional memory model, M(V, Ar) = (|M|, Wy, 
eap) comprises: 


1. a partial commutative monoid (PCM) (|M|,-,0) 
2. A well-formedness relation Wr C|M| with the following property: 


We (ur: u2) = We (Hi) A We (He2) 
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3. the predicate action execution function eap : Ar > |M| > V = R (|M|xV) 


First, we note that for concrete execution, Gillian also uses the return type 
R in the action execution function ea.* For Wy defined in Definition 1, the main 
properties that must be satisfied are Properties 3.1, 3.2, and 3.6 in [29]. 

The PCM requirement is required to show that the heap forms a separation 
algebra [16]. Wy is extended to state that memory composition must also be well- 
formed. Finally, the predicate action execution function eap provides a way to 
frame on and off parts of the memory, though they are not required for concrete 
execution as they are not part of the GIL concrete execution grammar. 

Using the CHERI-C memory model we defined earlier, we can show that 
our model conforms to both Definitions 1 and 2. Let Ac be the set of memory 
actions, H be the memory, eaç be the action execution function of the CHERI- 
C memory model, and we be the well-formedness relation. Then we observe 
that (H, wE, ea) forms an execution memory model. We note that Properties 
3.1 and 3.2 in [29] are satisfied, and Property 3.6 is trivial in that operations 
that return errors do not return an updated heap. We also note that the mem- 
ory model also conforms to a compositional memory model, as we have the 
PCM (H,W, uo) along with the well-formedness property being composition- 
compatible. The predicate action execution function is not required to be given, 
as the concrete execution of Gillian does not utilise this feature. 


4.3 Compiler 


We implemented a CHERI-C to GIL compiler by utilising ESBMC’s GOTO 
language. The idea is that ESBMC uses its own intermediate representation 
for bounded model checking, which is the GOTO language. CHERI-enabled 
ESBMC uses Clang as a front end to generate the GOTO language. In our case 
we can build a GOTO to GIL compiler instead of building a CHERI-C compiler 
from scratch. The GOTO language is very similar to GIL in that they are both 
goto-based languages and uses single static assignment. For most parts, the 
compilation process is straightforward. As ESBMC’s GOTO language is typed 
while the CHERI-C memory model is untyped—untyped in the sense that the 
memory model does not support user-defined types like st ructs—we make sure 
that capability arithmetic and casts are applied correctly by inferring the sizes 
of the user-defined types. 


5 Experimental Results 


In Sect. 4, we have provided a way to instantiate the Gillian tool, where we 
obtain a concrete CHERI-C model using Isabelle/HOL and a CHERI-C to GIL 


“In the Gillian literature, it is stated that R can return both a return value and 
an error. The OCaml implementation of Gillian slightly differs from this and is more 
similar to R used for the CHERI-C memory model. 
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compiler that utilises ESBMC’s GOTO language. Our framework can demon- 
strate that higher-level memory actions—such as memcpy (), which preserves 
tags when applicable—can be implemented. Furthermore, we can run concrete 
instances of programs that use memcpy () to show they emit the expected be- 
haviour. This also means the tool can catch the TagViolation exception that 
is triggered in Listing 1.1. Our tool also allows capability-related functions de- 
fined in cheriintrin.h and cheri.h, to be usable, i.e. it is possible to call 
operations such as cheri_tag_get () and cheri_tag_clear(). 


Filename GC/GCC/AM|BMC : z 

Fine 7 7 Filename Time(s) 
putter overt low:c libcmalloc.c | 8.585 
dangling_ptr.c v v x v : 

libc_memcpy.c 1.698 

double_free.c v v x v : 
i ; 5 libc_memmove.c 0.318 
invalid_free.c x vV v v libestring.c 0.315 
misaligned_ptr.c |V | Vv v x = z - 
listingl.c xi viv x Table 2: GCC runtime 


Table 1: Violation detection performance 


Table 1 shows a list of safety violations that Gillian-C, our tool, the ARM 
Morello hardware, and CHERI-ESBMC—labelled as GC, GCC, AM, and BMC, 
respectively—all catch. We observe that Morello fails to catch temporal safety 
violations such as dangling pointers and double frees. For the invalid free case, 
where we attempt to free a pointer not produced by malloc, we discovered a 
bug in the Gillian-C tool that fails to catch this violation.” Gillian-C does not 
return any errors for the program in Listing 1.1, which is to be expected, as this 
is not problematic for conventional C. Finally, we observe that CHERI-ESBMC 
fails to catch the last two violations that relating to tag invalidation. 

Table 2 shows the runtime performance of running the CHERI-C library test 
suites, based on the Clang CHERI-C test suite [1]. Tests were conducted on 
a machine running Fedora 34 on an 11t Gen Intel Core i7-1185G7 CPU with 
31.1 GB RAM, with trace logging enabled. We note that when the test cases 
were executed on Morello without any modifications to the code, all of the tests 
terminated instantaneously without any issues. In the libc-malloc.c test 
case, we reduced the scope of the test® to ensure the tool terminates within a 
reasonable time, though the performance can be drastically improved by turning 
logging off, e.g. the Libc_malloc.c case would only take 0.686 seconds. For 
the remaining tests, we made modifications to the code to ensure the compiler 
can correctly produce the GIL code, and we made sure to preserve all the edge 
cases covered by the original tests. For example, in libc_memcpy.c we made 
sure to test all cases where both src and dst capabilities were aligned and 
misaligned in the beginning and the end, which affected tag preservation. We 
observed that no assertions were violated, and we also observed that the same 


°The bug has since been fixed after a discussion with the developers [7]. 
In particular, we reduced max from the libc_malloc.c case in [1] from 20 to 9. 
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code when run in Morello also resulted in no assertion violations, demonstrating 
a faithful implementation of CHERI-C semantics. 


6 Related Work 


The CompCert C memory model [26], CH20O memory model [24], and Tuch’s C 
memory model [38] are C memory models formalised in a theorem prover, each 
focusing on different aspects of verification. Our model mostly draws inspiration 
from these models, extending such work to support CHERI-C programs. 

VCC, which internally uses the typed C memory model [19], and CHERI- 
ESBMC [15] are designed with automated verification of C programs via sym- 
bolic execution in mind—in particular, CHERI-ESBMC supports hybrid settings 
and compressed capabilities in addition to purecap settings and uncompressed 
capabilities. Both tools rely on a memory model that is not formally verified, so 
the tools have components that must be trusted. 


7 Conclusion and Future Work 


We have provided a formal CHERI-C memory model and demonstrated its utility 
for verification. We formalised the entire theory in Isabelle/HOL and generated 
an executable instance of the memory model, which was then used to instantiate 
a CHERI-C tool. The result lead to a concrete execution tool that is robust 
in terms of the properties that are guaranteed both by the tool and by the 
memory model. We demonstrated its practicality by running CHERI-C based 
test suites, capturing memory safety violations, and comparing the results with 
actual CHERI hardware—namely the physical Morello processor. 

Currently there are a number of limitations provided by the memory model. 
Capability arithmetic is limited only to addition and subtraction, but the heap 
can be extended to incorporate mappings from blocks to physical addresses and 
vice versa. This provides a way to extend capability arithmetic. While the theory 
incorporates abstract capabilities, compression is still under work. We believe, 
however, that the abstract design itself does not need to change. It may be 
possible to utilise the compression/decompression work to convert between the 
two forms [2] when needed whilst retaining our design for the operations. 

This theory serves as a starting point for much potential future work. A 
compositional symbolic memory model can be built from this design to enable 
symbolic execution and verification in Gillian. As we have already proven the 
core properties, proving the remaining properties for the extended model will 
allow automated separation logic based verification of CHERI-C programs. 
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Abstract. The correctness of real-time systems depends both on the 
correct functionalities and the realtime constraints. To go beyond the 
existing Timed Automata based techniques, we propose a novel solution 
that integrates a modular Hoare-style forward verifier with a term rewrit- 
ing system (TRS) on Timed Effects (TimEffs). The main purposes are 
to: increase the expressiveness, dynamically manipulate clocks, and effi- 
ciently solve clock constraints. We formally define a core language C*, 
generalizing the real-time systems, modeled using mutable variables and 
timed behavioral patterns, such as delay, timeout, interrupt, deadline. 
Secondly, to capture real-time specifications, we introduce TimEffs, a 
new effects logic, that extends regular expressions with dependent values 
and arithmetic constraints. Thirdly, the forward verifier reasons tempo- 
ral behaviors — expressed in TimEffs — of target C* programs. Lastly, we 
present a purely algebraic TRS, i.e., an extended Antimirov algorithm, 
to efficiently check language inclusions between TimEffs. To demonstrate 
the feasibility of our proposal, we prototype the verification system; prove 
its soundness; report on case studies and experimental results. 


1 Introduction 


During the last three decades, a popular approach for specifying real-time systems 
has been based on Timed Automata (TAs) [1]. TAs are powerful in designing 
real-time models via explicit clocks, where real-time constraints are captured by 
explicitly setting/resetting clock variables. A number of automatic verification 
tools for TAs have proven to be successful [2,3,4,5]. Industrial case studies show 
that requirements for real-time systems are often structured into phases, which 
are then composed sequentially, in parallel, alternatively [6,7]. TAs lack high- 
level compositional patterns for hierarchical design; moreover, users often need to 
manipulate clock variables with carefully calculated clock constraints manually. 
The process is tedious and error-prone. 

There have been some translation-based approaches on building verification 
support for compositional timed-process representations. For example, Timed 
Communicating Sequential Process (TCSP), Timed Communicating Object-Z 
(TCOZ) and Statechart based hierarchical Timed Automata are well suited for 
presenting compositional models of complex real-time systems. Prior works [8,9] 
systematically translate TCSP/TCOZ/Statechart models to flat TAs so that the 
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model checker Uppaal [3] can be applied. However, possible insufficiencies are: 
the expressiveness power is limited by the finite-state automata; and there is 
always a gap between the verified logic and the actual code implementation. 

In this work, we investigate an alternative approach for verifying real-time sys- 
tems. We propose a novel temporal specification language, Timed Effects ( Tim- 
Effs), which enables a compositional verification via a Hoare-style forward verifier 
and a term rewriting system (TRS). More specifically, we specify system behav- 
iors in the form of TimEffs, which integrates the Kleene Algebra with dependent 
values and arithmetic constraints, to provide real-time abstractions into tradi- 
tional linear temporal logics. For example, one safety property, “The event Done 
will be triggered no later than one time unit”', is expressed in TimEffs as: ® £ 
O<t<1 A (*-Done)#t. Here A connects the arithmetic formula and the timed 
trace; the operator # binds time variables to traces (here t is a time bound of 
(* - Done)); - is a wildcard matching to any event; Kleene star * denotes a trace 
repetition. The above formula ® corresponds to ‘(,9,;)Done’ in metric temporal 
logic (MTL), reads “within one time unit, Done finally happens”. Furthermore, 
the time bounds can be dependent on the program inputs, as shown in Fig. 1. 

Function addNSugar takes a parameter 
n, representing the portion of the sugar to 
add. When n=0, it raises an event EndSugar 
to mark the end of the process. Otherwise, 
it adds one portion of the sugar by call- 


1 void addOneSugar () 

2 /* req: true ^A ~ 

3 ens: t>l Ac #t */ 
i{ timeout ((), 1); } 


5 


e void addNSugar (int n) ing addOneSugar(), then recursively calls 
7 /* req: true A * addNSugar with parameter n-1. The use of 
s ens: t>n A EndSugar # t */  timeout(e, d) is standard [11], which exe- 
ə{ if (n == 0) { cutes a block of code e after the specified 
10 event ["EndSugar"];} time d. Therefore, the time spent on adding 
u else { one portion of the sugar is more than one 
12 addOneSugar () ; time unit. Note that c#t refers to an empty 
13 addNSugar (n-1);}} 


trace which takes time t. Both precondi- 
Fig. 1. Value-dependent specification. tions require no arithmetic constraints and 
no temporal constraints upon the history 
traces. The postcondition of addNSugar(n) indicates that the method generates a 
finite trace where EndSugar takes a no less than n time-units delay to finish. 
Although these examples are simple, they show the benefits of deploying 
value-dependent time bounds, which is beyond the capability of TAs. Essen- 
tially, TimEffs define symbolic TAs, which stands for a set (possibly infinite) of 
concrete transition systems. Moreover, we deploy a Hoare-style forward verifier 
to soundly reason about the behaviors from the source level, with respect to 
the well-defined operational semantics. This approach provides a direct (opposite 
to the techniques which require manual and remote modeling processes), and 
modular verification — where modules can be replaced by their already verified 
properties — for real-time systems, which are not possible by any existing tech- 


1 In this paper, we pretend time is discrete and only integral values. However, it’s just 
as easy to represent continuous time by letting time variables assume real values [10]. 
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niques. Furthermore, we develop a novel TRS, which is inspired by Antimirov 
and Mosses’ algorithm? [12] but solving the language inclusions between more 
expressive TimEffs. In short, the main contributions of this work are: 


1. Language Abstraction: we formally define a core language C*, by defining 
its syntax and operational semantics, generalizing the real-time systems with 
mutable variables and timed behavioral patterns, e.g., delay, timeout, deadline. 
2. Novel Specification: we propose TimEffs, by defining its syntax and seman- 
tics, gaining the expressive power beyond traditional linear temporal logics. 

3. Forward Verifier: we establish a sound effect system to reason about tem- 
poral behaviors of given programs. The verifier triggers the back-end solver TRS. 
4. Efficient TRS: we present the rewriting rules to (dis)prove the inclusion rela- 
tions between the actual behaviors and the given specifications, both in TimEffs. 
5. Implementation and Evaluation: we prototype the automated verification 
system, prove its soundness, report on case studies and experimental results. 


2 Overview 


An overview of our automated verifi- p---7a777--7777737 Pere ee 

g Heak : ; 1 Two TimEffs , 
cation system is given in Fig. 2. The ! = i 
system consists of a forward verifier L222- OE? lll? J ı LHS = RES ! 
and a TRS, i.e., the rounded boxes. 


$ ; ; F d Verifi Effects Inclusion 
The input of the forward verifier is a i ac lente 
for Œ Proving via a TRS 


C* program annotated with tempo- 
The program is verified? The inclusion is valid? 


ral specifications written in TimEffs. 
The input of the TRS is a pair of ef- 
fects LHS and RHS, referring to the 
inclusion LHS C RHS? to be checked 
(LHS and RHS refer to left/right-hand-side effects respectively). The forward ver- 
ifier calls TRS to solve proof obligations. Next, we use Fig. 3 to highlight our 
main methodologies, which simulates a coffee machine, that dynamically adds 
sugar based on the user’s input number. 

2.1 TimEffs. We define Hoare-triple style specifications (enclosed in /*...*/) 
for each function, which leads to a compositional verification strategy, where 
static checking can be done locally. The precondition of makeCoffee specifies that 
the input value n is non-negative, and it requires that before entering into this 
function, this history trace must contain the event CupReady on the tail. The 
verification fails if the precondition is not satisfied at the caller sites. Line 17 
sets a five time-units deadline (i-e., maximum 5 portion of sugar per coffee) while 
calling addNSugar (defined in Fig. 1); then emits event Coffee with a deadline, 
indicating the pouring coffer process takes no more than four time-units. The 
precondition of main requires no arithmetic constraints (expressed as true) and 
an empty history trace. The postcondition of main specifies that before the final 


Fig. 2. System Overview. 


2 Antimirov and Mosses’ algorithm was designed for deciding the inequalities of regular 
expressions based on an axiomatic algorithm of the algebra of regular sets. 
3 The TimEffs inclusion relation C is formally defined in Definition 3. 
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Done happens, there is no occurrence of Done (! indicates the absence of events); 
and the whole process takes no more than nine time-units to hit the final event. 


11 void makeCoffee (int n) TimEffs support more fea- 
is /* req: n>O ^A _*- CupReady tures such as disjunctions, guards, 
16 ens: n<t<5 A t’<4 A parallelism and assertions, etc 

(EndSugar # t) - (Coffee # t?) */ (cf. Sec. 3.3), providing de- 
ı7 { deadline (addNSugar(n), 5); tailed information upon: branch- 


is deadline (event["Coffee"],4);} ing properties: different arith- 


metic conditions on the inputs 
lead to different effects; and re- 
a whee BAS A (CDon) $t) «Bone + quired history traces: by defin- 
os { event ["CupReady"]; ing the prior effects in pre- 
2 makeCoffee (3); condition. These capabilities are 
os  event["Done"];} beyond traditional timed ver- 
ification, and cannot be fully 
Bee ' captured by any prior works 
sugar within nine time units. [8,9,2,3,4,5]. Nevertheless, the in- 
crease in expressive power needs support from finer-grind reasoning and a more 
sophisticated back-end solver, discharged by our forward verifier and TRS. 


20 int main () 
21 /* req: true ^ € 


Fig. 3. To make coffee with three portions of 


1. void addOneSugar(){ // initialize the state using the function precondition. 
SoHHpadonesusgar(m) L ferue A *}  [FV-Meth] 

2. timeout ((), 1);} 
o={t1>1 A *-(e # ti)} [FV-Timeout] 


co ee oe ee eed AS eel) CO A Sat) 


p 


4. void addNSugar (int n){ // initialize the state using the function precondition. 
po=p4NSugarlin) L ferue A *}  [FV-Meth] 
5. if (n == 0){ 
{n=0 A *}  [FV-Cond] 
6. event ["EndSugar"];} 
{n=0 A *- EndSugar} [FV-Event] 
7. else { 
{n#0 A *}  [FV-Cond] 
8. addOneSugar () ; 
{nZ0At2>1 A *-(e # t2)} [FV-Call] 
9. addNSugar (n-1);}} 
nZOAt2>1 A *-(e # t) C BadenSugar(m-1) 7/7 TRS: precondition checked. 
{nZoAt2>1 A * +(e # t2). Oo rer) «LFV -Call] 


10. o= (n=0 A *-Sugar) V (nAOAt2>1 A *- (t2) psum) | PV_Cond] 


post 
11. p E er : Dan <= //TRS: postcondition checked, cf. Table 1 
(=O A Sugar) V (@A0nt2>1 A (e e422) eer) E a ree 


Fig. 4. The forward verification examples (t1 and t2 are fresh time variables). 


2.2 Forward Verification. Fig. 4 demonstrates the forward verification of 
functions addOneSugar and addNSugar, defined in Fig. 1. The effects states are 
captured in the form of {® c}. To facilitate the illustration, we label the steps 
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by (1) to (11), and mark the deployed forward rules (cf. Sec. 4.1) in [gray]. The 
initial states (1) and (4) are obtained from the preconditions, by the [FV -Meth] 
rule. States (5)(7)(10) are obtained by [FV-Cond], which enforces the condi- 
tional constraints into the effects states, and unions the effects accumulated 
from two branches. State (6) is obtained by [FV-Event], which concatenates 
an event to the current effects. The intermediate states (8) and (9) are obtained 
by [FV-Call]. Before each function call, [FV-Call] invokes the TRS to check 
whether the current effects states satisfy callees’ preconditions. If it is not satis- 
fied, the verification fails; otherwise, it concatenates the callee’s postcondition to 
the current states (the precondition check for step (8) is omitted here). 

State (2) is obtained by [FV- Timeout], which adds a lower time-bound to an 
empty trace. After these state transformations, steps (3) and (11) invoke the TRS 
to check the inclusions between the final effects and the declared postconditions. 
2.3 The TRS. Having TimEffs to be the specification language, and the 
forward verifier to reason about the actual behaviors, we are interested in the 
following verification problem: Given a program P, and a temporal specification 
6’, does the inclusions PP C 6’ holds? Typically, checking the inclusion /entail- 
ment between the concrete program effects ®? and the expected property ®' 
proves that: the program P will never lead to unsafe traces which violate ®’. 

Our TRS is an extension of Antimirov and Mosses’s algorithm [12], which 
can be deployed to decide inclusions of two regular expressions (REs) through an 
iterated process of checking inclusions of their partial derivatives [13]. There are 
two basic rules: [Disprove] infers false from trivially inconsistent inclusions; and 
[Unfold] applies Definition 2 to generate new inclusions. 


Definition 1 (Derivative). Given any formal language S over an alphabet X 
and any string uc&™, the derivative of S with respect to u is defined as: 
ui S={wed* | uwes}. 


Definition 2 (REs Inclusion). For REs r and s, rxs@V(AEX).A 4 (r) sat (s). 


Definition 3 (TimEffs Inclusion). For TimEffs ®; and ®p, 
6, CO, SVAVt>O. (att) O; C (att) 02. 


Similarly, we defined Definition 3 for unfolding the inclusions between Tim- 
Effs, where (A#t)~1 ©® is the partial derivative of ® w.r.t the event A with the time 
bound t. Termination of the rewriting is guaranteed because the set of derivatives 
to be considered is finite, and possible cycles are detected using memorization (cf. 
Table 5) [14]. Next, we use Table 1 to demonstrate how the TRS automatically 
proves the final effects of main satisfying its postcondition (shown at step (11) in 
Fig. 4). We mark the rewriting rules (cf. Sec. 5) in [gray]. 

In Table 1, step @ renames the time variables to avoid the name clashes 
between the antecedent and the consequent. Step @) splits the proof tree into 
two branches, according to the different arithmetic constraints, by rule [LHS-OR]. 
In the first branch, step @) eliminates the event ES from the head of both sides, 
by rule [UNFOLD]. Step @ proves the inclusion, because evidently the consequent 
tR>0O A e#tR contains e when tR=0. In the second branch, step ©) eliminates a 
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Table 1. An inclusion proving example. (J) is the right hand side sub-tree of the the 
main rewriting proof tree. (ES stands for the event EndSugar) 


sets Sich E E @® [PROVE 
DOA CC mee A eF TR @) [UNFOLD] 


n=0 A BSL tR>0 A ES#tR 


addNSugar(n-1 ye pidiNSugar(n) 
post = * post 


(n=0 A ES) V (nZ0At2>1 A (e # t2): © 


beet Nels (a71) A eRe) > CRA. (PROVE) 
OT i) hs tn iia eee a © [UNFOLD] 7,:tL=(tR-t2) 


nAOAt2>1AtL>(n-1) A ES#tL C tR>n A ES#CER=t2) © [UNFOLD] 


nZOAt2>1AtL>(n-1) A e#t2- ES#tL C tR>n A^ ES#tR 


time duration e#t2 from both sides. Therefore the rule [UNFOLD] subtracts a time 
duration from the consequent, i.e., (tR-t2). Similarly, step © eliminates ES#tL 
from the both sides, adding tL=(tR-t2) to the unification constraints. Step @% 
proves t2>1AtL> (n-1) AtL=(tR-t2)=>tR>n *; therefore, the proof succeed. 
2.4 Verifying the Fischer’s Mutual Exclusion Protocol. Fig. 5 presents 
the classical Fischer’s mutu- 


= <1; ; : 

are ally exclusion protocol, in C’. 
2 var cs:= 0; . oe 
i Global variables x and cs indi- 
: void proc (int i) { cate ‘which process attempted 
[x=-1] //block waiting until true to access the critical section 
deadline (event ["Update"(i)]{x:=i},d); most recently’ and ‘the number 
delay (e); of processes accessing the crit- 

if (x==i) { ical section’ respectively. The 


event ["Critical"(i)]{cs:=cs+i}; 
event ["Exit"(i)]{ces:=cs-1;x:=-1}; 
proc (i); 

} else {proc (i);}} 


main procedure is a parallel 
composition of three processes, 
where d and e are two con- 
stants. Each process attempts 


: void main () to enter the critical section 
; /* req: d<e A € ense: true/\(cs<i)* when x is -1, i.e. no other pro- 

ens,: trueA ((_*).Critical.Exit.(_*))* */ cess is currently attempting. 
*{ proc(0) || proc(1) I| proc(2); } Once the process is active (i.e., 


reaches line 6), it sets x to 
its identity number i within d 
time units, captured by deadline(...,d). Then it idles for e time units, captured 
by delay(e) and then checks whether x still equals to i. If so, it safely enters the 
critical section. Otherwise, it restarts from the beginning. Quantitative timing 
constraint d<e plays an important role in this algorithm to guarantee mutual ex- 
clusion. One way to prove mutual exclusion is to show that cs<1 is always true. 
Or, using event temporal logic, we can show that the occurrence of Critical 
always indicates the next event is Exit. We show in Sec. 6 that our prototype 
system can verify such algorithms symbolically. 


Fig. 5. Fischer’s mutually exclusion algorithm. 


“ The proof obligations for arithmetic constraints are discharged by the Z3 solver [15]. 
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3 Language and Specifications 


3.1 The Target Language 


We define the core language Ct in Fig. 6, which is built based on C syntax and 
provides support for timed behavioral patterns. 


Program) P ::= (a*, meth*) 

Types) t ::= int | bool | void 

Method) meth ::= ı mn (1 x)* {req pre ens post} {e} 

Values) vn=()|clbla 

Assignment) =g 

Expressions) e =v | a | [v]e | mn(v*) | e1;e2 | er|lee | if v e1 e2 | event[A(v,a*)| 
| delay[v] | ey timeout[v] e2 | e deadline[v] | ey interrupt[v] ez 

Terms) t ::= c | x | t1+te | ty-te 

cEZ bEB mn, x € var (Action labels) A € X 


Fig. 6. A core first-order imperative language with timed constructs via implicit clocks. 


Here, c and b stand for integer and Boolean constants, mn and z are meta- 
variables, drawn from var (the countably infinite set of arbitrary distinct identi- 
fiers). A program P comprises a list of global variable initializations a* and a list 
of method declarations meth*. Here, we use the * superscript to denote a finite 
list of items, for example, x* refers to a list of variables, x1, ..., %,. Each method 
meth has a name mn, an expression-oriented body e, also is associated with a 
precondition ®,,. and a postcondition post (specification syntax is given in Fig. 
7). Ct allows each iterative loop to be optimized to an equivalent tail-recursive 
method, where mutation on parameters is made visible to the caller. 

Expressions comprise: values v; guarded processes [v]e, where if v is true, it 
behaves as e, else it idles until v becomes true; method calls mn(v*); sequential 
composition ez; eg; parallel composition e;||e2, where e; and eg may communi- 
cate via shared variables; conditionals if v e eg; and event raising expressions 
event[A(v,a*)] where the event A comes from the finite set of event labels X. 
Without loss of generality, events can be further parametrized with one value v 
and a set of assignments a* to update the mutable variables. Moreover, a number 
of timed constructs can be used to capture common real-time system behaviors, 
which are explained via operational semantics rules in Sec. 3.2. 


3.2 Operational Semantics of C* 


To build the semantics of the system model, we define the notion of a configura- 
tion in Definition 4, to capture the global system state during system execution. 


Definition 4 (System configuration). A system configuration C is a pair 
(S,e) where S is a variable valuation function (or a stack) and e is an expression. 


A transition of the system is of the form ¢ an Ç’ where ¢ and ¢’ are the system 
configurations before and after the transition respectively. Transition labels | 
include: d, denoting a non-negative integer; 7, denoting an invisible event; A, 
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denoting an observable event. For example, ¢ 4 ¢’ denotes a d time-units elapse. 
Next, we present the firing rules, associated with timed constructs. 

Process delay[v] idles for exactly t time units. Rule [delay;] states that the 
process may idle for any amount of time given it is less than or equal to t; Rule 
[delayg] states that the process terminates immediately when t becomes 0. 


d<v 
` [delay] = [delays] 
(S,delay[v]) > (S,delay[v-d]) (S, delay[0]) > (S, ()) 


In eı timeout|v] e2, the first observable event of e; shall occur before t time 
units; otherwise, eg takes over the control after exactly t time units. Note that 
the usage of timeout in Fig. 1 is a special case where e; never starts by default. 


(S, ex) 4s (S', e!) (S, e1) 3 (S', el) 


1 [to1] : oe [to2] 
(S, e1 timeout[v] e2)—>(S’, e!) (S,e1 timeout[v] e2)—>(S’, e, timeout[v]ez) 
d 1 
Se) (6) A os —__ = [to] 
(S, e1 timeout|v] e2)=>(S, e; timeout[v-d]eg) (S, e; timeout[0]e2)>(S, e2) 


Process deadline |v] e behaves exactly as e except that it must terminate 
before t time units. The guarded process [v]e behaves as e when v is true, other- 
wise it idles until v becomes true. Process e, interrupt|v] e2 behaves as e, until 
t time units, and then ex takes over. We leave the rest rules in [16]. 


A/r 


(S, e) > (S',€') (dal, (Sie) (8%) gat 
(S,deadline|v] e) am (S’, deadline[v] e’) (S, deadline[v] e) 5 (S’, v) 

SE Gem) otis] (S,e) > (S, e’) (d<v) [ddly 
(S, [vle) > (S, e) (S, deadline[v] e) (S,deadline[v-d] e’) 

S p (vstrue) gus] (S, e1) 5 (S'e!) me 
(S, [v]e) > (S, [v]e) (S, e1 interrupt[v] e2) Az, (S', e; interrupt[v] e2) 

(S, e1) 5 (S', v) [inte] ints 

(S,e; interrupt[v] e2)-+(S’, v) (S,e1 interrupt[0] e2) 5 (S,e2) ` 

(S, e1) & (S,e4) (dv) Wi 


S, e1 interrupt|v| ez 4 S,e, interrupt|v-d] eg 
P 1 P 


3.3 The Specification Language 


We plant TimEffs specifications into the Hoare-style verification system, using 
pre and post to capture the temporal pre/post conditions. As shown in Fig. 7, 
TimEffs can be constructed by a conditioned event sequence 7 / 0; or an effects 
disjunction ©; V ®j. Timed sequences comprise nil (L); empty trace €; single 
event ev; concatenation 0; - 02; disjunction 0; V 02; parallel composition 0; ||02; 
a block waiting for a certain constraint to be satisfied 7?6. We introduce a new 
operator #, and 0#t represents the trace 0 takes t time units to complete, where t 
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(Timed Effects) ® := r A0 | ©; V Be 
(Event Sequences) 0 ::= L | e| ev | 01-02 | 01 V02 


) 

) b2 
re ev = A(v,a*) | T(r) | Al - 

) 


61| 


720 | O#t | 0* 


(Pure r := True | False | bop(t1, t2) | TI A 12 | T1V 12 | T | 11> 


(Real-Time Terms) t::= c | x | t1+te | tr-te 


cEZ x € var (Real Time Bound) # (Kleene Star) x 


Fig. 7. Syntax of TimEffs. 


is a real-time term. A timed sequence also can be constructed by 6*, representing 
zero or more times repetition of the trace @. For single events, A(v,a*) stands 
for an observable event with label A, parameterized by v, and the assignment 
operations a*; T(r) is an invisible event, parameterized with a pure formula 7°. 

Events can also be A, referring to all events which are not labeled using A; 
and a wildcard _, which matches to all the events. We use 7 to denote a pure for- 
mula which captures the (Presburger) arithmetic conditions on terms or program 
parameters. We use bop(t;, tg) to represent binary atomic formulas of terms (in- 
cluding =, >, <, > and <). Terms consist of constant integer values c; integer 
variables x; simple computations of terms, t;+tg and t;,-te. 


3.4 Semantic Model of Timed Effects 


Let d, S, p® denote the model relation, i.e., a stack S, a concrete execution 
trace y take d time units to complete, and they satisfy the specification ®. 


d, S, p = 01 V Oe iff d,S,p 01 or d,S, p |= © 

d,S,pErAe iff d=0 and [r]s.=True and y=|] 

d,S,p = TA ev iff d=0 and [r]s=True and y=[ev| 

d,S,p RTA (01 - G2) iff Ip1, p2. pittpe2=y and Ad;, d2. dı +d2=d 
s.t. d1, S, pır A0: and d2, S, p2 =T A b2 

d, S, p E T A (01V82) iff d,S,p Ear A@1 or d,S, p ETA b2 

d, S, p H| mA(ev1-61)||(ev2e-02) iff d,S,p E mA ev: - (81||(eve - A2)) or 


d,S,p E mA eve - ((evr - 61)||92) 
d,S,p H| mA(eu- 61)||(ev - 02) iff d,S,p H mA ev - (81\|82) 


d,S,p E mA (ett: )||(c#t2) iff d, S, p H (mAt;>te) A (e#t1) - (e#(t1-te)) or 
d,S,p = (TAt <tə) A (etto) : (e#(te-t1 )) 


d,S,p Ear A7120 iff [n:]s=True, d, S, p E T A98 or 
T1|s=False, d, S, p H T A 7170 
d, S, p =| r 0#t iff [r A t>0]s=True, 301,02. 01 - 02=0, fresh tı, ta, s.t. 
d, S, p(T A^ tı>0^t2>0^tı +to=t)A(61#t1 )-(O2#te) 
d, S, p H T A 0* iff d,S, pH nrA^Aeceor d, S, p Hrb. 0* 
d, S, p — false iff [n]s=False or p= 


Fig. 8. Semantics of TimEffs. 


5 The difference between T(7) and 1? is: T(m) marks an assertion which leads to false 
(L) if m is not satisfied, whereas 7? waits until 7 is satisfied. 
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To define the model, var is the set of program variables, val is the set of 
primitive values; and d, S, y are drawn from the following concrete domains: d: N, 
S: var—val and y: list of event. As shown in Fig. 8, ++ appends event sequences; 
[| describes the empty sequences, [ev] represents the singleton sequence contains 
event ev; []s=True represents 7 holds on the stack S. Notice that, simple 
events, i.e., without #, are taken to be happening in instant time. 


3.5 Expressiveness. TimEffs draw similarities to metric temporal logic (MTL), 
which is derived from LTL, where a set of non-negative real numbers is added to 
temporal modal operators. As shown in Table 2, we are able to encode MTL 

operators into TimEffs, making it more intuitive and readable. The basic modal 
operators are: O for “globally”; > for “finally”; O for “next”; U for “until”, and 


their past time reversed versions: LJ; Q; and © for “previous”; S for “since”. I in 
MTL is the time interval with concrete upper/lower bounds; whereas in TimEffs 
they can be symbolic bounds which are dependent on program inputs. 


Table 2. Examples for converting MTL formulae into TimEffs with tel applied. 


Prost || Ora = (A*)#t | OAS (*- alt | OAS (j#t-A | AUB = (A*)#t-B 
Spre || Ora = (ayet | Ora = (A >et | SrA = A- ((9#t) | ASPB =B. ((A*)#t) 


4 Automated Forward Verification 


4.1 Forward Rules 
Forward rules are in the Hoare-style triples St {17,0} e {II’, 0'}, where S is 
the stack environment; {/7,@} and {iI’,0’} are program states, i.e., disjunc- 
tions of conditioned event sequence m ^0. The meaning of the transition is: 
{1', 0} =U Ol 11T, Ol} where (mi^0:) € {IT, O} and H {mii} e {I, 006. 
We here present the rules for time-related constructs and leave the rest rules 
in [16]. Rule [FV-Delay] creates a trace e#t, where t is fresh, and concatenates it 
to the current program state, together with the additional constraint t=v. Rule 
[F V-Deadline| computes the effects from e and adds an upper time-bound to the 
results. Rule [F V- Timeout] computes the effects from e; and eg using the start- 
ing state {r, <}. The final state is an union of possible effects with corresponding 
time bounds and arithmetic constraints. Note that, hd(@,) and tl(@,) return the 
event head (cf. Definition 6), and the tail of O; respectively. 


[F'V-Delay] [F'V-Deadline] 
6’ = 6. (e#t) (tis fresh) St {r,e} e {1,01} (tis fresh) 
St {7,0} delay[v] {tA(t=v),0’} St {2,0} deadline[v] e {II;A(t<v),0-(O,#t)} 
[FV- Timeout] 
St {x,e} e {1h, 01} St {7,6} e2 {I2, O2} (tı, t2 are fresh) 


{IT}, Or} = {IhAti<v, (hd(@1)#t1) G tl(O1)} U {II2A\t2=0, (e#t2) G O2} 
St {7,0} e1 timeout[v] e2 {7,0 - Os} 
[FV-Interrupt] 
St {r,e} e1 {11,0} = ULER OTH yintermpt(ym (O, e) SH {A} e2 {II',0'} 


Eo e1 interrupt[v] e2 {M',0- O'} 


ê |{IT, O}| is the size of {/7, O}, i.e., the count of conditioned event sequence 7^8. 
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[FV-Interrupt] computes the inter- 
ruption interleaves of e,’s effects, which 
come from the over-approximation 
of all the possibilities. For exam- 
ple, for trace A-B, the interruption 


Algorithm 1: Interruption 
Interleaving 

Input: v,7, 9, Onis 

Output: Program States: A 
function Nieto n) (9, Onis) 


with time ¢ creates three possibilities: 3 Ae| EE 

(e#t) V (A#t) V ((A-B)#t). Then the rule 3 foreach f€fst,(0) do 
continues to compute the effects of e9; 4 db TA(E<v) A (Onistt) 
lastly, it prepends the original history 0 5 0’ — D7 (0) 

to the final results. Algorithm 1 presents 6 Onis < Onis © f 

the interleaving algorithm for interrup- 7 Ney ere rl (O Or) 
tions, where + unions program states 8 AAA 

(cf. Definition 7 and Definition 8 for fst ẹ return A 


and D functions). 


Theorem 1 (Soundness of Forward Rules). Given any system configuration 
¢=(S, e), by applying the operational semantics rules, if (S, e)—>*(S', v) has ez- 
ecution time d and produces event sequence p; and for any history effect T^O, 
such that d1,S,p1- (aA), and the forward verifier reasons St{7, O}e{ IT, O}, 
then A(n'AO’) € {II, O} such that (d,+d),S’, (yi t++p)E(m'N0"). (C(—*¢ denotes 


the reflexive, transitive closure of ¢ + ¢'.) 


Proof. See the technical report [16]. 


5 Temporal Verification via a TRS 


The TRS is an automated entailment checker to prove language inclusions be- 
tween TimEffs. It is triggered prior to function calls for the precondition checking; 
and by the end of verifying a function, for the post condition checking. 

Given two effects ®; and ®g, the TRS decides if the inclusion ®; C p 
is valid. During the effects rewriting process, the inclusions are in the form of 
[+ ©, E? ®o, a shorthand for: [+ ©-@,; E ®- By. To prove such inclusions 
is to check whether all the possible timed traces in the antecedent ®; are legit- 
imately allowed in the timed traces described by the consequent ®». Here I" is 
the proof context, i.e., a set of effects inclusion hypothesis; and ® is the history 
effects from the antecedent that have been used to match the effects from the 
consequent. The checking is initially invoked with =Ø and ®=True ^ €. 


Effects Disjunctions. An inclusion with a disjunctive antecedent succeeds if 
both disjunctions entail the consequent. An inclusion with a disjunctive conse- 
quent succeeds if the antecedent entails either of the disjunctions. 


rke,C® rro PITT rc, or TESCO 
TEO®,VECO@ | -OR] TF®LG,V SE 


2 [RHS-OR] 


Now, the inclusions are disjunction-free formulas. Next we provide the defini- 
tions and key implementations of auxiliary functions Nullable, First and Deriva- 
tive. Intuitively, the Nullable function 6,(@) returns a Boolean value indicating 
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whether 7/0 contains the empty trace; the First function fst,(@) computes a set 
of initial heads, denoted as h, of 70; the Derivative function D7 (0) computes 

a next-state effects after eliminating the head h from the current effects m A 6. 

true if €€ [nA] 
false if e ¢[r^0] 
5n(L)=6,(ev)=false 6, (€)=6(0*)=true 6, (m’20)=5q (8) 5, (0; VO2)=6(81)V5(Be) 
5n (0 -02)=8(01)A8(02) Sn (81||02)=5(01)A5(O2) 8r (O#t)=SAT(WA(t=0)) A bx (8) 


Definition 5 (Nullable ‘). Given any =r ^ 0, 6x(0): bot=| 


Definition 6 (Heads). If h is a head of m ^90, then there exist n’ and 0’, such that 
TAO =7'A(h-6'). A head can be t, denoting a pure time passing; A(v, a*), denoting 
an instant event passing; or (A(v,a*),t), denoting an event passing which takes time t. 


Definition 7 (First). Given any ®=n A 0, fst.(0) returns a set of heads, be the set of 
initial elements derivable from effects n \ 0, where (t' is fresh): 
fstr(L)=fstr(e)={}  fste(A(v,a"))={A(v, a°)} — fatr(e#t)={t}  fstr(0*)=fst: (0) 
fstn(O#t)={(A(v, 0°), t") | A(v,a*)Efstr(O)}  fstr(01V02)=fstz (01) U fstx(O2) 
fstx(1'?0)=fstx (0) fstx(01||02)=fstx (01) U fst (Oe) 
fstz(01) U fstr(O2) if 6(01)=true 
fstz(01) if 6(01)=false 


fstx(O1 < nf 


Definition 8 (TimEffs Partial Derivative). Given any =r ^ 0, the partial deriva- 
tive Df (0) computes the effects for the left quotient h"‘ (a A @), cf. Definition 1. 


Df (L)=Df(e)=FalseAL Df (A(v,a"))=(mA(h=A(v,0")))Ae DZ (6*)= DF (0)-0* 


TAW 20 if mix Dy (01)-02V D; (02) if dn(01)=true 


Dre) ('20)= | Di (01-02)= 


TAO if m1=>7 Di (01)-02 if 6n(01)=false 
Dihw,a*),t) (0) = ViDio.a (0) | (x ^ 0’) € Di ()} 
DI (0#t')=(m A t+t"=t') A 0#t” = ("is fresh) Di (01V02)=Dz (01) V D} (02) 


Dico ar) (O#)=\V {TAO #)) | (1A0')E Dia (O DE (81||02)=Dh (01) || Dh (Ge) 


Notice that the derivatives of a parallel composition makes use of the Parallel 
= = 0 if DF 0) = (FalseAL 
Derivative D7 (0), defined as follows: Df (0)= jd if Di ees PEEL] 
Di (0) otherwise 
5.1 Rewriting Rules. Given the well-defined auxiliary functions above, we now 
discuss the key rewriting rules that deployed in effects inclusion proofs. 


— ~ [Bot-LHS OARA ens 
rrene N Progen N 

On 0 A môr (0 > Teo t 0 = 

(1) 2 (02) [DISPROVE] mime me fbn, (Oi) = 1} [PROVE] 


Pram nb: Z T2 02 


LHn Abi ET Ab 


T SAT(m) stands for querying the Z3 theorem prover to check the satisfiability of 7. 
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Axiom rules [Bot-LHS] and [Bot-RHS] are analogous to the standard proposi- 
tional logic, L (referring to false) entails any effects, while no non-false effects 
entails L. [DISPROVE] is used to disprove the inclusions when the antecedent is 
nullable, while the consequent is not nullable. 

We use two rules to prove an inclusion: (i) [PROVE] is used when the antecedent 
has no head; and (ii) [REOCCUR] proves an inclusion when there exist inclusion 
hypotheses in the proof context l, which are able to soundly prove the current 
goal. [UNFOLD] is the inductive step of unfolding the inclusions. The proof of the 
original inclusion succeeds if all the derivative inclusions succeed. 


(71 A01 L 73/03) el (17303 L T7404) el (m1 A04 L 7202) er 
Tbr NO, E m2 A602 
H=fst (0) T'=T,(m^01 E m2A82) Whe H. (I D7 (01) E D7? (02)) 
TemAA E m2 A 02 


[REOCCUR] 


[UNFOLD] 


Theorem 2 (Termination of the TRS). The TRS is terminating. 


Proof. See the technical report [16]. 


Theorem 3 (Soundness of the TRS). Given an inclusion Pı E ®2, if the 
TRS returns TRUE with a proof, then ®; CE ®g is valid. 


Proof. See the technical report [16]. 


6 Implementation and Evaluation 


To show the feasibility, we prototype our automated verification system using 
OCaml (~5k LOC); and prove soundness for both the forward verifier and the 
TRS. We set up two experiments to evaluate our implementation: i) function- 
ality validation via verifying symbolic timed programs; and ii) comparison with 
PAT [17] and Uppaal [3] using real-life Fischer’s mutual exclusion algorithm. Ex- 
periments are done on a MacBook with a 2.6 GHz 6-Core Intel i7 processor. The 
source code and the evaluation benchmark are openly accessible from [18]. 


6.1 Experimental Results for Symbolic Timed Models. We manually 
annotate TimEffs specifications for a set of synthetic examples (for about 54 pro- 
grams), to test the main contributions, including: computing effects from sym- 
bolic timed programs written in C*; and the inclusion checking for TimEffs with 
the parallel composition, block waiting operator and shared global variables. 

Table 3 presents the evaluation results for another 16 Ct programs®, and the 
annotated temporal specifications are in a 1:1 ratio for succeeded/failed cases. 
The table records: No., index of the program; LOC, lines of code; Forward(ms), 
effects computation time; #Prop(/), number of valid properties; Avg-Prove(ms), 
average proving time for the valid properties; #Prop(X), number of invalid prop- 
erties; Avg-Dis(ms), average disproving time for the invalid properties; #AskZ3, 
number of querying Z3 through out the experiments. 


8 All programs contain timed constructs, conditionals, and parallel compositions. 
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Table 3. Experimental Results for Manually Constructed Synthetic Examples. 
No./LOC/Forward(ms) |#Prop(/)|Avg-Prove(ms) |#Prop(X)|Avg-Dis(ms) |#AskZ3 
1 26 0.006 5 52.379 5 21.31 77 
2 37 43.955 5 83.374 5 52.165 188 
3 44 32.654 5 52.524 5 33.444 104 
4 72 202.181 5 82.922 5 55.971 229 
5 98 42.706 7 149.345 7 60.325 396 
6 | 134 403.617 7 160.932 7 292.304 940 
7 | 133 51.492 7 17.901 7 47.643 118 
8 | 173 57.114 7 40.772 7 30.977 128 
9 | 182 872.995 9 252.123 9 113.838 1142 
10 | 210 546.222 9 146.341 9 57.832 570 
11 | 240 643.133 9 146.268 9 69.245 608 
12 | 260 1032.31 9 242.699 9 123.054 928 
13 | 265 12558.05 11 150.999 11 117.288 2465 
14 | 286 12257.834 11 501.994 11 257.800 3090 
15 | 287 1383.034 11 546.064 11 407.952 1489 
16 | 337 49873.835 11 1863.901 11 954.996 15505 


Observations: i) the proving/disproving time increases when the effect computa- 
tion time increases because larger Forward(ms) indicates the higher complexity 
w.r.t the timed constructs, which complicates the inclusion checking; ii) while 
the number of querying Z3 per property (#AskZ3/(#Prop(/)+#Prop(X))) goes 
up, the proving/disproving time goes up. Besides, we notice that iii) the disprov- 
ing times for invalid properties are constantly lower than the proving process, 
regardless of the program’s complexity, which is as expected in a TRS. 

6.2 Verifying Fischer’s mutual exclusion algorithm. As shown in Fig. 
4, the data in columns PAT(s) and Uppaal(s) are drawn from prior work [19], 
which indicate the time to prove Fischer’s mutual exclusion w.r.t the number of 
processes (#Proc) in PAT and Uppaal respectively. For our system, based on the 
implementation presented in Fig. 5, we are able to prove the mutual exclusion 
properties, given the arithmetic constraint d<e. Besides, the system disproves 
mutual exclusion when d<e. We record the proving (Prove(s)) and disproving 
(Disprove(s)) time and their number of uniquely querying Z3 (#AskZ3-u). 


Table 4. Comparison with PAT via verifying Fischer’s mutual exclusion algorithm 


#Proc || Prove(s) | #AskZ3-u || Disprove(s) | #AskZ3-u || PAT(s) | Uppaal(s) 
2 0.09 31 0.110 37 <0.05 <0.09 
3 0.21 35 0.093 42 <0.05 <0.09 
4 0.46 63 0.120 47 0.05 0.09 
5 25.0 84 0.128 52 0.15 0.19 


Observations: i) automata-based model checkers (both PAT and Uppaal) are 
vastly efficient when given concrete values for constants d and e; however ii) our 
proposal is able to symbolically prove the algorithm by only providing the con- 
straints of d and e, which cannot be achieved by existing model checkers; ii) our 
verification time largely depends on the number of querying Z3, which is opti- 
mized in our implementation by keeping a table for already queried constraints. 
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6.3 Case Study: Prove it when Reoccur. Termination of TRS is guaranteed 
because the set of derivatives to be considered is finite, and possible cycles are 
detected using memorization [14], demonstrated in Table 5. In step @), in order 
to eliminate the first event B, A*#tR has to be reduced to €, therefore the RHS 
time constraint has been strengthened to tR=0. Looking at the sub-tree (I), in 
step ©), tL and tR are split into tL’+tL* and tR’+tR*®. Then in step ©, A#tL’ 
together with A#tR’ are eliminated, unifying tL’ and tR! by adding the side 
constraint tL'=tR'. In step ®), we observe the proposition is isomorphic with one 
of the the previous step, marked using (t). Hence we apply the rule [REOCCUR] to 
prove it with a succeed side constraints entailment. 


Table 5. The reoccurrence proving example. (J) is the left hand side sub-tree of the 
main rewriting proof tree. 


EEEE EEEE ® [PROVE] 
True A eC tR=0 A € A 
Hosted esa eee @) [Normal] 
(J) True A BE tR=0 A eB 
Sees esi terete tree eee tere reef eee -----2-------- Q@) [UNFOLD] 
tL<3A (A*#tL)-B C tR<4/A (A*#tR)-B True A BC tR<4 A (A*#tR) - B C [oR-LHS| 
(tL<3 A (A*#tL) - B) V (True A B) C tR<4 A (A*#HtR) - B 
(1) 1 2 1 2 1 1 2 2 
< + = = +tR~ = = < 
tL SAtL = EL = TLATRSER = FER AtL = E AEL eR th 4 [REOCCUR| 
tL<3 A (A*#tL?) - B C tR<4 A (A*#tR?) - B Œ) 
Soe se hie Sie ee SEE eee es eee ee eee eee @) [UNFOLD] 


eee E hoe ee ee ee (©) [UNFOLD] mu :tL*=tR? 


socio ant atic a a peste eee ee eee eee (B) [SPLIT|tL+tL7=tLAtR'+th7=tR 
tL<3 A (A*#tL) - B C tR<4 A (A*#tR) - B ($) 


6.4 Discussion. Our implementation is the first that proves the inclusion of 
symbolic TAs, which is considered significant because it overcomes the following 
main limitations of traditional timed model checking: i) TAs cannot be used to 
specify/verify incompletely specified systems (i.e., whose timing constants have 
yet to be known) and hence cannot be used in early design phases; ii) verifying a 
system with a set of timing constants usually requires enumerating all of them if 
they are supposed to be integer-valued; iii) TAs cannot be used to verify systems 
with timing constants to be taken in a real-valued dense interval. 


7 Related Work 


7.1 Verification Framework. This work draws the most similarities to [20], 
which also deploys a forward verifier and a TRS for extended regular expressions. 
The differences are: i) [20] targets general-purpose sequential programs without 
shared variables, whereas this work targets time-critical programs with the pres- 
ence of concurrency and global shared states; ii) the dependent values in [20] 
denote the number of repetitions of a trace, whereas in this work, they abstract 
the real-time bounds; iii) in this work, the TRS supports inclusion checking for 
the block waiting operator 7? and the concurrent composition ||. These are es- 
sential in timed verification (or, more generally, for distributed systems), which 
are not supported in [20] or any other TRS-related works. 
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7.2 Specifications and Real-Time Verification. Apart from compositional 
modelling for real-time systems based on timed-process algebras, such as Timed 
CSP [8] and CCS+Time [21], there have been a number of translation-based ap- 
proaches on building verification support for timed-process algebras. For example, 
in [8], Timed CSP is translated to TAs (TAs) so that the model checker Uppaal [3] 
can be applied. On the other hand, all the translation-based approaches share the 
common problem: the overhead introduced by the complex translation makes it 
particularly inefficient when disproving properties. We are of the opinion that in 
that the goal of verifying real-time systems, in particular safety-critical systems 
is to check logical temporal properties, which can be done without constructing 
the whole reachability graph or the full power of model-checking. We consider 
our approach is simpler as it is based directly on constraint-solving techniques 
and can be fairly efficient in verifying systems consisting of many components as 
it avoids to explore the whole state-space [20,22]. 

This work draws similarities to Real-Time Maude [23], which complements 
timed automata with more expressive object-oriented specifications. 


7.3 Clock Manipulation and Zone-based Bisimulation. The concept of 
implicit clocks has also been used in time Petri nets, and implemented in a 
several model checking engines, e.g., [24]. On the other hand, to make model 
checking more efficient with explicit clocks, [25,26,27,28] work on dynamically 
deleting or merging clocks. Our work also draw connections with region/zone- 
based bisimulations [29], which is broadly used in reasoning timed automata. 


8 Conclusion 


This work provides an alternative approach for verifying real-time systems, where 
temporal behaviors are reasoned at the source level, and the specification expres- 
siveness goes beyond traditional Timed Automata. We define the novel effects 
logic TimEffs, to capture real-time behavioral patterns and temporal properties. 
We demonstrate how to build axiomatic semantics (or rather an effects system) 
for C* via timed-trace processing functions. We use this semantic model to enable 
a Hoare-style forward verifier, which computes the program effects constructively. 
We present an effects inclusion checker — the TRS — to efficiently prove the an- 
notated temporal properties. We prototype the verification system and show its 
feasibility. To the best of our knowledge, our work proposes the first algebraic 
TRS for solving inclusion relations between timed specifications. 


Limitations And Future Work. Our TRS is incomplete, meaning there exist 
valid inclusions which will be disproved in our system. That is mainly because 
of insufficient unification in favour of achieving automation. We also foresee the 
possibilities of adding other logics into our existing trace-based temporal logic, 
such as separation logic for verifying heap-manipulating distributed programs. 
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We consider parameterized verification of systems executing according to the 
total store ordering (TSO) semantics. The processes manipulate abstract data 
types over potentially infinite domains. We present a framework that translates 
the reachability problem for such systems to the reachability problem for register 
machines enriched with the given abstract data type. We use the translation to 
obtain tight complexity bounds for TSO-based parameterized verification over 
several abstract data types, such as push-down automata, ordered multi push- 
down automata, one-counter nets, one-counter automata, and Petri nets. We 
apply the framework to get complexity bounds for higher order stack and counter 
variants as well. 


1 Introduction 


A parameterized system consists of a fixed but arbitrary number of identical pro- 
cesses that execute in parallel. The goal of parameterized verification is to prove 
the correctness of the system regardless of the number of processes. Examples 
for such systems are sensor networks, leader election protocols, and mutual ex- 
clusion protocols. The topic has been the subject of intensive research for more 
than three decades (see e.g. [10,32,13,6]), and it is the subject of one chapter of 
the Handbook of Model Checking [8]. Research on parameterized verification has 
been mostly conducted under the premise that (i) the processes run according 
to the classical Sequential Consistency (SC) semantics, and (ii) the processes are 
finite-state machines. 

Under SC, the processes operate on a set of shared variables through which 
they communicate atomically, i.e., read and write operations take effect immedi- 
ately. In particular, a write operation is visible to all the processes as soon as the 
writing process carries out its operation. Therefore, the processes always main- 
tain a uniform view of the shared memory: they all see the latest value written 
on any given variable, hence we can interpret program runs as interleavings of 
sequential process executions. Although SC has been immensely popular as an 
intuitive way of understanding the behaviours of concurrent processes, it is not 
realistic to assume computation platforms guarantee SC anymore. The reason 
is that, due to hardware and compiler optimizations, most modern platforms 
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allow more relaxed program behaviours than those permitted under SC, leading 
to so-called weak memory models. Weakly consistent platforms are found at all 
levels of system design such as multiprocessor architectures (e.g., [48,47]), Cache 
protocols (e.g., [46,21]), language level concurrency (e.g., [41]), and distributed 
data stores (e.g., [17]). Therefore, in recent years, research on the parameterized 
verification of concurrent programs under weak memory models have started to 
become popular. Notable examples are the cases of the TSO semantics [4] and 
the Release-Acquire semantics of C11 [39]. 

In a parallel development, several works have extended the basic model of pa- 
rameterized systems (under the SC semantics) by considering processes that are 
infinite-state systems. The most dominant such class has been the case where the 
individual processes are variants of push-down automata [36,33,28,28,40,42,30] 

Parameterized verification is difficult, even under the original assumption of 
both SC and finite-state processes as we still need to handle an infinite state 
space. The extension to weakly consistent systems is even more complex due to 
the intricate extra process behaviours. Almost all weak memory models induce 
infinite state spaces even without parameterization and even when the program 
itself is finite-state. Therefore, performing parameterized verification under weak 
consistency requires handling a state space that is infinite in two dimensions; one 
due to parameterization and one due to the weak memory model. The same ap- 
plies to the extension of parameterized verification under SC where the processes 
are infinite-state: in addition to infiniteness due to parameterization, we have a 
second source of infinity due to the infiniteness of the processes. 

In this paper, we combine the above two extensions. We study parameter- 
ized verification of programs under the TSO semantics, where the processes use 
infinite data structures such as stacks and counters. The framework is uniform 
in that the manipulation can be described using an abstract data type. 

We revisit the pivot abstraction technique presented in [4]. As a first contri- 
bution, we show that we can capture pivot abstraction precisely, using a class 
of register machines in which the registers assume values over a finite domain. 
We show that, for any given abstract data type A, we can reduce, in polynomial 
time, the parameterized verification problem under TSO and A to the reach- 
ability problem for register machines manipulating A. Furthermore, we show 
that the reduction also holds in the other direction: the reachability problem 
for register machines over A is polynomial-time reducible to the parameterized 
verification problem under TSO for A. In particular, the model abstracts away 
the semantics of TSO (in fact, it abstracts away concurrency altogether) since 
we are dealing with a single register machine. 

We summarize the contributions of the paper as follows: 


— We present a register abstraction scheme that captures the behaviour of 
parameterized systems under the TSO semantics. 

— We translate parameterized verification under the TSO semantics when the 
processes manipulate an ADT A, to the reachability problem for register 
machines operating over A. 

— We instantiate the framework for deciding the complexity of parameterized 
verification under TSO for different abstract data types. In particular we 
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show the problem is PSPACE-complete when A is a one-counter, EXP TIME- 
complete if A is a stack, 2-ETIME-complete if A is an ordered multi stack, 
and EXPSPACE-complete if A is a Petri net. We obtain further complexity 
bounds for higher order counter and stacks. 


Related Work There has been an extensive research effort on parameterized 
verification since the 1980s (see [13,8] for recent surveys of the field). Early works 
showed the undecidability of the general problem (even assuming finite-state 
processes) [10], and hence the emphasis has been on finding useful special cases. 
Such cases are characterized by three aspects, namely the system topology (un- 
ordered, arrays, trees, graphs, rings, etc.), the allowed communication patterns 
(shared memory, Rendez-vous, broadcast, lossy channels, etc.), and the process 
types (anonymous, with IDs, with priorities, etc.) [27,20,31,24,23,43]. 

Another line of research to counter undecidability are over-approximations 
based on regular model checking [38,14,16,1], monotonic abstraction [5], and 
symmetry reduction [37,22,7]. 

A seminal work in the area is the paper by German and Sistla [32]. The 
authors consider the verification of systems consisting of an arbitrary number 
of finite-state processes interacting through Rendez-Vous communication. The 
paper shows that the model checking problem is EXPSPACE-complete. In a series 
of more recent papers, parameterized verification has been considered in the case 
where the individual processes are push-down automata. [36,33,28,40,42,30].All 
the above works assume the SC semantics. 

Due to the relevance of weak memory models in parameterized verification, 
papers on the topic have started to appear in the last two years. The paper 
[4] considers parameterized verification of programs running under TSO, and 
shows that the reachability problem is PSPACE-complete. However, the paper 
assumes that the processes are finite-state and, in particular, the processes do 
not manipulate unbounded data domains. The model of the paper corresponds 
to the particular case of our framework where we take the abstract data type to 
be empty. In this case our framework also implies PSPACE-completeness. 

The paper [39] shows PSPACE-completeness when the underlying semantics is 
the Release-Acquire fragment of C11. The latter semantics gives rise to different 
semantics compared to TSO. The paper also considers finite-state processes. 

The paper [2] considers parameterized verification of programs running un- 
der TSO. However, the paper applies the framework of well-structured systems 
where the buffers of the processes are modeled as lossy channels, and hence the 
complexity of the algorithm is non-primitive recursive. In particular, the paper 
does not give any complexity bounds for the reachability problem (or any other 
verification problems). Conchon et al. [19] address the parameterized verification 
of programs under TSO as well. They make use of Model Checker Modulo The- 
ories, no decidability or complexity results are given. The paper [15] considers 
checking the robustness property against SC for parameterized systems running 
under the TSO semantics. However, the robustness problem is entirely different 
from reachability and the techniques and results developed in the paper cannot 
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be applied in our setting. The paper shows that the problem is EXPSPACE-hard. 
All these works assume finite-state processes. 

In contrast to all the above works, the current paper is the first paper that 
studies decidability and complexity of parameterized verification under the TSO 
semantics when the individual processes are infinite-state. 


2 Preliminaries 


We denote a function f between sets A and B by f : A— B. We write fia <+ b] 
to denote the function f’ such that f’(a) = b and f’(x) = f(x) for all x £ a. 

For a finite set A, we use |A| to refer to the size of A. We also use A* to 
denote the set of words over A including the empty word e. For a word w € A*, 
we use |w| to refer to the length of w. We say a word w is differentiated if all 
symbols in w are pairwise different. The set Af is the set of all differentiated 
words over the set A. Finally, for a differentiated word w, we define pos(w)(a) 
as the unique position of the letter a in w. 

A labelled transition system is a tuple (C, Cinit, Labs, —>), where C is the set 
of configurations, Cint C C is the set of initial configurations, Labs is a finite 
set of labels and —> C C x Labs x C is the transition relation over the set of 
configurations. For a transition (c1,lab,c2) E€ —, we usually write cı ae 


instead. We use c1 —> Cg to denote that cı ca C2 for some lab € Labs. Further- 
more, we write — to denote the transitive reflexive closure over —, and if 
cı — cp then we say C2 is reachable from c1. If c1 € Cint, then we just say that 


c2 is reachable. A run p is an alternating sequence of configurations and labels 


: laby labo lab, % 
and is expressed as follows: co ——> cy ——> C2.. .Cn—-1 ———> Cn . Given p, we 


write co —> cn meaning that c, is reachable from cg by n steps, and we write 
co —> cn meaning that c, is reachable from co through the run p. 


3 Abstract Data Types (ADT) 


In this section, we introduce the notion of abstract data types (ADTs) which 
will be used extensively in the paper. An ADT is a labelled transition system 
A = (Vals, {valine}, Ops, — a). Intuitively, this describes the behaviour of some 
data type such as a stack, or a counter. Vals is the set of configurations of A. It 
describes the possible values the data type can assume. The initial configuration 
is Valin € Vals. The set of labels Ops represents the operations that can be 
executed on the data type and the transition relation —>, € Vals x Ops x 
Vals describes the semantics of these operations. Below, we give some concrete 
examples of abstract data types. 


Example 1 (Counter). We define a counter, denoted by the ADT CT, as follows. 
The set of configurations Vals©? = N are the natural numbers. The initial value, 
denoted by val? , is 0. The set of operations is Ops©? = {inc, dec, isZero}. The 


transition relation — cr is as follows: The operations inc and dec increase or 
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decrease the value of the counter by one, respectively. The latter operation is 
only enabled if the value of the counter is non-zero, otherwise it blocks. Finally, 
the transition isZero checks that the value of the counter is zero, i.e. it is only 
enabled if that condition is true. 


Example 2 (Weak Counter). A weak counter differs from a counter in that it can- 
not be checked for zero. The ADT WCT representing a weak counter is defined as 
in Example 1, except the operations of WCT are reduced to Ops“? = {inc, dec}. 


Example 3 (Stack). Let I be a finite set representing the stack alphabet. A stack 
St = (Vals°7, {val?",}, Ops°”, sr) on T is defined as follows. The configurations 
of ST are Vals°* = I* and the initial configuration is the empty stack val?) = e. 
The set of operations is Ops” = {pop(y), push(y), isEmpty | y € I}. The 
transition relation is as follows. For every word w € [* and every symbol y € 
I’, push(y) adds the symbol y to the top of the stack. Similiarly, the pop(7) 
operation removes the topmost symbol from the stack. It is only enabled if the 
topmost symbol on the stack. The isEmpty operation does not change the stack, 
but can only be performed if the stack is the empty word e. 


Example 4 (Petri Nets). Given a Petri net[44], We can define a corresponding 
ADT PETRI that models its semantics. The values are the markings, the oper- 
ations are the Petri net transitions and the transition relation is given by the 
input and output vectors of the Petri net transitions. 


Higher Order ADTs We extend the ADT ST to higher order stacks referred 
to as n-ST. This is done recursively[18,25]. The formal definition is in the full 
version of our paper [3]. A value of a level n higher order stack n-ST is a stack 
of level n — 1 stacks. For level 1, it is the standard stack ST. The operations 
for level n are Ops”** = {pop(7), push(y), popp, push,, | y E€ T,2 < k < n}. 
The operations pop(y) and push(y) are recursively applied to the top element 
in the stack (which consists of a stack that is one level lower) until the level of 
the top element is 1. Here, they have the standard stack behaviour. Operations 
pop, and push, are recursively applied to the top element until the level of the 
element is k. Then, a copy of this level k stack is pushed on top of the original. 
Since a counter can be seen as a stack with an alphabet of size 1 (and a bottom 
element L), we can extend definitions of WCT and CT to n-wCT and n-CT in the 
same way. We add operations inc,;,dec;. All operations are recursively applied 
to the top counter. For inc,dec,isZero, we use standard behaviour once the 
level is 1. For inc, dec, we copy/remove the top element once the level is k. 


Example 5 (Ordered Multi Stack). We extend the stack to a numbered list of 
n many stacks n-OMST [12]. A value of n-OMST consists of list of stacks 
valf7...valS*. An operation Ops™OMST = {isZero;, pop,(7),push,(7), | y € 
T,i < n} works on stack number i in the standard way. One additional condi- 
tion is that the stacks have to be ordered, meaning an operation pop,(7) is only 
enabled if the stacks 1...2— 1 are empty. 
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4 TSO with an Abstract Data Type : TSO(A) 


In this section, we introduce concurrent programs running under TSO(A) for 
an ADT A = (Vals, {valinit}, Ops, a). These programs consist of concurrent 
processes where the communication between processes is performed using shared 
memory under the TSO semantics. In addition, each process maintains a local 
variable of type A. 

Syntax of TSO(A). Let Dom be a finite data domain and Vars be a finite set of 
shared variables over Dom. Let dinn € Dom be the initial value of the variables. 
We define the instruction set of TSO(A) as Instrs = {rd(x,d),wr(x,d) | x € 
Vars,d € Dom} U {skip,mf}, which are called read, write, skip and memory 
fence, respectively. 

A process is represented by a finite state transition system. It is given by 

the tuple Proc = (Q, dinit, ô), where Q is a finite set of states, qini E€ Q is the 
initial state, and 6 C Q x (Instrs U Ops) x Q is the transition relation. We call 
this tuple the description of the process. A concurrent program is a tuple of 
processes P = (Proc,),ez, where Z is some finite set of process identifiers. For 
each ų € Z we have Proc’ = (Q', dhit, 6°). 
Semantics of TSO(A). We describe the semantics of a program P running 
under TSO(A) by a labelled transition system Tp = (C? , CP., Labs”, — p). The 
formal definition is given in [3]. Under TSO(A), there is an unbounded FIFO 
buffer of writes between each process and the memory. A configuration c € C? 
of the system consists of the value of each variable in the shared memory as well 
as for each process: its local state, its value of the ADT, and the content of the 
corresponding write buffer. 

The labelled transitions —p are as follows: A local skip transition simply 
updates the state of the corresponding process. An ADT operation additionally 
updates the ADT value according to ADT behaviour —>,. When a process exe- 
cutes a write instruction, the operation is enqueued as a pending write message 
into its buffer. A message msg is an assignment of the form msg = (x,d), where 
x € Vars and d € Dom. We denote the set of all messages by Msgs = Vars x Dom. 
The buffer content for a process is given as a word over Msgs. The messages in- 
side each buffer are moved non-deterministically to the main memory in a FIFO 
manner. Once a message reaches the memory, it becomes visible to all the other 
processes. When executing a read instruction on a variable x € Vars, the process 
first checks its buffer for pending write messages on x. If the buffer contains such 
a message, then it reads the value of the most recent one. If the buffer contains 
no write messages on x, then the process fetches the value of x from the memory. 
The initial configuration is cP., where each process is in its initial state, each 
ADT holds its initial value, each store buffer is empty and the memory holds 
the initial values of all variables. Note that since FIFO buffer is unbounded, this 
is an infinite state transition system, even for finite ADT. 


A sequence of transitions Co Lae C1 diay Co. o a Cn where 
co = ch, is the initial configuration and lab; € Labs” is called a run in the 
TSO(A) transition system. If there is a run ending in a configuration with state 
final; then we say qfinal is reachable by Proc under TSO(A). 
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5 Parameterized Reachability in TSO(A) 


In this section, we consider the parameterized TSO setting which allows for 
an a priori unbounded number of processes with the same process description. 
We begin by formally introducing the parameterized state reachability problem, 
and then develop a generic construction that allows us to represent the TSO 
semantics (except for the ADT) in a finite manner. 


The Parameterized State Reachability Problem Intuitively, parameterization al- 
lows for an arbitrary number of identical processes. The parameterized state 
reachability problem for TSO(A) called TSO(A)-P-Reach identifies a family of 
(standard) reachability problem instances. We want to determine whether we 
have reachability in some member of the family. We now introduce this formally. 

For a given process description Proc, we consider the program instance, Pf, ,. 
parameterized by a natural number n as follows. For Z = {1,...,n}, let Phoe = 
(Proc;,...,Proc,) with Proc, = Proc for all ¿ € Z. That is, the nt? slice of 
the parameterized family of programs contains n processes, all with identical 
descriptions Proc. We require that all processes maintain copies of the ADT A. 


TSO(A)-P-Reach: 
Given: A process Proc = (Q, dint, ô), an ADT A, and a state qfinal € Q, 
Decide: Is there a n € N s.t. qfinai is reachable by P$ oc under TSO(A)? 


When talking about a certain family of ADTs, e.g. the family of petri nets, 
we write TSO(PETRI)-P-Reach and mean the restriction of TSO(A)-P-Reach to 
petri nets, i.e. to instances where A is a petri net. 

The main difference between the non-parameterized case and the parameter- 
ized case of the problem is that in the first case the index set Z is a priori fixed, 
while in the second case it can be arbitrary. This results in C?., being a singleton 
in the non-parameterized case while it becomes infinite (one initial state for each 
n-slice) in the parameterized case. 

We determine upper and lower bounds for the complexity of the state reacha- 
bility problem. The challenge of solving this problem varies with the ADT. This 
problem for plain TSO without an ADT has been studied in [4]. They showed 
that the problem can be decided in PSPACE and is in fact PSPACE-complete. The 
result is based on an abstraction technique called the pivot semantics. The pivot 
semantics is exact in the sense that a state q is reachable under parameterized 
TSO if and only if it is reachable under the pivot semantics. 

We show that the dynamics underlying the pivot abstraction can be gen- 
eralized to our model with ADT. We show that the pivot abstraction can be 
extended to obtain a register machine. We use this construction to give a gen- 
eral characterization of TSO(A)-P-Reach. First, we recall the pivot abstraction. 
The Pivot Abstraction [4]. For a set of variables Vars and data domain Dom, 
processes generate pending write messages from the set Msgs = Vars x Dom by 
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executing wr instructions. This set has size |Vars|-|Dom| and hence at most as 
many distinct (variable, value) pairs can be produced in any run. For a run p of 
the program, for each message msg = (x, d} € Msgs we can define the first point 
along p at which some write on variable x with value d is propagated to the 
memory. The pivot abstraction identifies these points as pivot points pvt(msg), 
for each distinct message in Msgs. For a write message msg under p, the pivot 
point pvt(msg) is the first point of propagation of msg to the memory under p. 

The core observation is that if at some point in p, a process Proc, propagates 
a message msg = (x,d) from its buffer to the memory, then after that point, 
the value d will always be available to read on variable x from the shared mem- 
ory. Technically, this follows from parameterization. There are arbitrarily many 
processes executing identical descriptions. This means transitions of the origi- 
nal process Proc, can be mimicked by a clone process Proc, identical to Proc,. 
Hence, Proc, can replicate the execution of Proc, right up to the point where 
the message msg is the oldest message in its buffer. Then a single propagate 
step updates the value of x in the shared memory to d. There can be arbitrarily 
many such clones and the propagate step can happen at any time. It follows that 
beyond the pvt(msg) point in p, the value d can always be read from x. 

For distinct messages from Msgs, we can order the pivot points corresponding 

to these messages according to the order in which they appear in p. This gives 
us a first update sequence, denoted by w. No two messages in w are the same; 
the set of such sequences is the set of differentiated words Msgsyi¢. A message 
msg € Msgs in w has the rank k if it is the k-th pivot point in w. 
Providers. The pivot abstraction simulates a run p under the TSO semantics by 
running abstract processes called providers in a sequential manner. For 1 < k < 
|w| +1, the k-provider simulates the process that generates the write of the rank 
k message (x, d) corresponding to the k-pivot in p. The k-provider completes its 
task when it has simulated this process until the point it generates (x, d}. At this 
point, it invokes the (k+1)-provider. With this background, we now develop the 
formal pivot semantics for parameterized TSO(A). 


Formal Pivot semantics for Parameterized TSO(A). We define the formal oper- 
ational semantics of the pivot abstraction as a labelled transition system. Given 
a process description Proc = (Q, dint, 6) and ADT A = (Vals, {valinie}, Ops, a), 
a configuration of the pivot transition system represents the view of a provider 
when simulating a run of the program. A view v = (q, val, Lw,w, dz, L, dp) 
is defined as follows. The process state is given by q € Q. The value of the 
provider’s ADT A is val € Vals. The function Lw : Vars — Dom U{@} gives for 
each x € Vars, the value of the latest (i.e., most recent) write the provider has 
performed on x. If no such instruction exists (the process has made no writes to 
x) then Lw(x) = @. Note that Lw abstracts the buffer in terms of read-own-write 
operations since the process can only read from the most recent pending write 
in its buffer on each variable (if it exists). We define Lwg such that Lwg(x) = @ 
for all x € Vars. The first update sequence of pivot messages is w E€ Msgsyig. It 
is unchanged by transitions and remains constant throughout the pivot run. 
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The external pointer, dg € {0,1,...,|w|} helps the provider keep track of 
which messages from w it has observed. These messages have been propagated by 
other processes. The external pointer is used to identify which variables are still 
holding their initial values in the memory. If the provider observes an external 
write on a variable x (by accessing the memory), then this write has overwritten 
the initial value of x in the memory. The local pointer z : Vars >{0,1,..., |w|} 
is a set of pointers, one for each variable x € Vars. The function z (x) gives the 
highest ranked write operation the provider itself has performed (on any variable) 
before it performed the latest write on x. The local pointer is necessary to know 
which variables lose their initial values when we need to empty the buffer. In 
other words, the local pointer abstracts the buffer in terms of update operations. 
We define ¢7** := max{@z,(x) | x € Vars} as the highest value of a local pointer 
and $9 such that ¢9(x) = 0 for all variables x € Vars, i.e., the pointers are all 
in the leftmost position. The progress pointer dp E€ {1,2,...,|w| +1} gives the 
rank of the process the current provider is simulating. 


(q, skip, q’) € ô 


(q, val, Lw, w, $2, 61, dP) =p (q', val, Lw, w, dn, OL, OP) 


(q; wr (x, d), q') € 5, pos(w)({x, d)) < bp, $r = ġL[x + max(pos(w)((x, d)), 92°™)] 


skip 


write(1) á 
(q, val, Lw, w, dz, ỌL, oP) 9). e(q, val, Lwi < d],w, on, $L, PP) 
write(2) (q wex, dq’) € 6, petada =¢rp 
(q, val, Lw, w, PE, QL, pP) EAN Vinit (w, op + 1) 
read(1) (q, rd(x, d), q’) € 6, Lw(x) = d 
(q, val, Lw, w, ġe, r, 6P) => lq', val, Lw, w, dx, 61, oP) 
(q, ra(x, d), q’) € 8, d = init(x), Lw(x) = L, pos(w)(x) > oz 
read(2) mx 
(q, val, Lw, w, bg, 62, OP) a plq, val, Lw, w, bz, OL, P) 
ad (q, ra(x, d), q’) € 5, pos(w)((x,d)) < dp, $p = max(ox, L(x), pos(w)((x, d))) 
read(3) 


(q, val, Lw, w, ġe, 61, OP) =O plq’, val, LW, w, $a, OL, OP) 


(q,mf,q’) € ô, $p = max(¢n, o7*) 
(q, val, Lw, w, de, $L, dP) = pu (q’, val, Lw, w, oe, OL, P) 


memory-fence 


(q, op, q’) € 5, op € Ops, val => val’ 


(q, val, Lw, w, $E, $L, dr) —>pve(q’, val’, Lw, w, bn, PL, OP) 


data-operation 


Fig. 1: The transition relation of the pivot semantics for a process Proc. 


Given an update sequence w € Msgs‘ and 1 < k < |w| +1, we de 
fine the initial view induced by w and k denoted by vint(w, k), as the view 
(qin. valinit, Lwi,w,0, 6%, k). For a given w, the k-provider starts with vinit(w, k): 
Lw, and ¢% imply that the simulated process has not performed any writes and 
or = 0 means that it has not read/updated from/to the memory. 
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We define the labeled transition relation —>,y, on the set of views by the 
inference rules given in Figure 1. The set of labels is InstrsU Ops. We describe 
the inference rules briefly. The skip rule only changes the local state of the 
process. There are two inference rules, write(1) and write(2), to describe the 
execution of a write operation wr(x,d). The rule write(1) describes the situation 
when the rank of (x,d) is strictly smaller than the progress pointer dp. In this 
case, we update both Lw and ¢;. The rule write(2) describes the situation when 
the rank of (x,d) equals the progress pointer. This means that the provider has 
provided the message (x,d) with rank dp. Hence it has completed its mission, 
and initiates the next provider by transitioning to Vinit(w, dp + 1). 

There are three inference rules that describe a read operation rd(x,d). The 
rule read(1) describes when the last written value to x by the provider is d, 
Lw(x) = d. In this case, the provider simply reads from its local buffer. The 
rule read(2) describes the read of an initial value. It ensures that the read is 
possible by checking that no write operation on x is executed by the provider 
(Lw(x) = L), and by checking that the initial value of the variable has not been 
overwritten in the memory. This is achieved by checking if the position of (x, d} in 
w, ie. pos(w)((x,d)), is strictly larger than dg. The rule read(3) describes when 
the simulated process reads from the memory. It checks that the message (x, d) 
has been generated by some previous provider (pos(w)((x,d)) < @p), and then 
it updates the external pointer to max(¢z, ¢1(x), pos(w)((x,d))). The memory 
fence rule describes when the simulated process does a fence action. The rule 
updates the external pointer to max(¢z, ¢f*). Finally, the data-operation rule 
describes when the simulated process does an ADT operation. 

The set of initial views is Vint = {Vinit(w, 1) |w € Msgs"). This is the set of 
initial views of the 1-provider and it is finite because Msgs“ is finite, unlike the 
set of initial configurations Cinit in the parameterized case under TSO. 


6 Register Machines 


Our goal is to design a general method to determine the decidability and com- 
plexity of TSO(A)-P-Reach depending on A. We examine the pivot abstraction 
introduced in the previous chapter. A view v = (q, val, Lw,w,¢z,¢1,¢p) of the 
pivot transition system, can be partitioned into the following two components: 
(1) q, Lw, w, 0E, ỌL, p which contains the local state and also effectively ab- 
stracts the unbounded FIFO buffers and shared memory of the TSO system and 
(2) val which captures the value of the ADT. The first part is finite since each 
component takes finitely many values. We call this the book-keeping state since 
it keeps track of the progress of the core TSO system. However, the ADT part 
can be infinite, depending upon the abstract data type. 

We will use a register machine in order to represent the book-keeping state 
in a finite way using states and registers. On the other hand, we will keep the 
ADT component general and only later instantiate it to some interesting cases. 

A register machine is a finite state automaton that has access to a finite set of 
registers, each holding a natural number. The register machine can execute two 
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operations on a register, it can write a given value or it can read a given value. 
A read is blocking if the given value is not in the register. We differ from most 
definitions of register machines in two significant ways: Since we only require a 
finite domain to model TSO(A) semantics, the values of the registers are bound 
from above by an N € N. This makes the register assignments finite whereas 
most definitions allow for an unbounded domain. Further, our register machine 
is augmented with an ADT. 

Given an ADT A = (Vals, {valint}, Ops, a), let Regs be a finite set of 
registers and Dom = {0,..., N} their domain. We define the set of actions 
Acts = {SKP,WRITE(r,d),READ(r,d) | r € Regs,d € Dom}. A register machine 
is then defined as a tuple R(A) = (Q, qinit, ô), where Q is a finite set of states, 
init € Q is the initial state and 6 C Q x (ActsU Ops) x Q is the transition relation. 

The semantics of the register machine are given in terms of a transition 
system. The set of configurations is Q x Dom®® x Vals. A configuration consists 
of a state, a register assignment Regs — Dom and a value of A. The initial 
configuration is (qinit, O88, valinit), where all registers contain the value 0. 

The transition relation — is described in the following. SKP only changes 
the local state, not the registers or the ADT value. WRITE(r, d) sets the value of 
the register r to d. READ(r,d) is only enabled if the value of r is d, it does not 
change the value. The operations in Ops work as usually, they do not change 
any register. We define the state reachability problem for register machines as 
R(A)-Reach in the usual way. A state qfinaı E Q is reachable if there is a run of 
the transition system defined by the semantics of R(A) that starts in the initial 
configuration and ends in a configuration with state qfinal- 


6.1 Simulating Pivot Abstraction by Register Machines 


In this section we will show how to simulate the pivot abstraction by a register 
machine. The idea is to save the book-keeping state (except for the local state) 
in the registers. Given a process description Proc = (QP"°s, qPtoc, 6Prec) for an 
ADT A, we construct a register machine R(A) = (Q, qinit, 6) that simulates the 
pivot semantics as follows. The set of registers is 


Regs := {Lw(x), rkvars(x), rksgs(msg), Oz, ØL (x), P”, Op, tknxe | x € Vars, msg € Msgs} . 


The registers rkyars(x) and rkmegs(msg) hold the rank of each variable and mes- 
sage, respectively. This implicitly gives rise to an update sequence. The aux- 
iliary register rkn« is used to initialize the other rank registers, as will be ex- 
plained later on. The remaining registers correspond to their respective coun- 
terparts in the pivot abstraction. Note that the number of registers is linear in 
the number of messages |Msgs|. The domain of the registers is defined to be 
Dom = {0,...,|Msgs| + 1}. Since the TSO memory domain is finite, we can 
assume w.l.o.g. that the memory values are positive integers. If Lw(x) = 0, it 
means that there has been no write on x and it still holds the initial value. The 
set of states Q contains QP°* U {qR (A), qi} as well as a number of (unnamed) 
auxiliary states that will be used in the following. 
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To simplify our construction, we will use additional operations on registers, 
instead of just WRITE and READ. We introduce different blocking comparisons 
between registers and values such as ==, <, <, Æ, register assignments such as 
r:=r’, and increments by one denoted as r++. A more detailed description of 
these instructions is given in [3]. 


The Initializer. The pivot semantics define an exponential number of initial 
states: one per possible update sequence. The register machine instead guesses an 
update sequence at the start of the execution and stores it in the rank registers. 
This part of the register machine is the rank initializer (shown in Figure 2 
(a)). It uses the auxiliary register rkm to keep track of the next rank that is 
to be assigned. In a nondeterministic manner, the rank initializer chooses a 
so far unranked message and then it assigns the next rank to this message. If 
the variable of the message has no rank assigned yet, it updates the rank of 
the variable. Then it increases the rkn register and continues. After each rank 
assignment, the initializer can choose to stop the rank assignment. In that case, 
it initializes the register dp to 1 and finishes in the initial state of Proc. 


In addition to the rank initializer, we have the pointer initializer. It is respon- 
sible for resetting all pointers except the process pointer to zero. The process 
pointer is incremented by one instead. This initializer is not executed in the 
beginning of the simulation, but between epochs of the pivot abstraction. 


The simulator. The main part of this construction handles the simulation 
of the pivot abstraction. It contains QP'°* as well as several auxiliary states that 
are described in the following. It simulates each instruction of TSO(A). The skip 
instruction and the data instructions are carried out unchanged. A visualization 
of the remaining instructions is depicted in Figure 2. In case of a write instruction 
wr(x, d), we first compare the rank of the write message with the process pointer. 
If they are equal, it means that the epoch is finished and the next process should 
start, therefore we jump to the first state of the pointer initializer. Otherwise, 
we set the last write pointer Lw(x) to d. Now, we ensure that ¢7™ is at least as 
large as the rank of (x,d) and finally we update the local pointer g(x) to be 
equal to ¢f**. For the memory fence instruction, it only needs to be ensured that 
the external pointer is at least as large as the maximum local pointer ¢f™. For a 
read instruction rd(x, d), if the last write to x was of value d, we can execute the 
read directly. Otherwise, after checking that the write can be performed by the 
current provider, we ensure that the external pointer is at least as large as both 
the rank of (x,d) and the local pointer of x. For the special case that d = dinit, 
there is an additional way in which the read can be performed: We can read dinit 
from the memory if the process has neither already written to x nor observed a 
write that has higher or equal rank than the rank of x. This gives us the following 
theorem, proven in Appendix C of the full version [3]: 


Theorem 1. TSO(A)-P-Reach is polynomial time reducible to R(A)-Reach. 
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Fig. 2: The rank initializer and the simulator for some instructions instr. 
6.2 Simulating Register Machines by TSO 


We will now show how to simulate an ADT register machine with a parameter- 
ized program running under TSO(A). The main idea is to save the information 
about the registers in the last pending write operations, while making sure that 
not a single write operation actually hits the memory. Thus, the simulator always 
reads the initial value or its own writes, never writes of other processes. 

The TSO program has a variable for each register, and two additional vari- 
ables x, and xe that act as flags: x, indicates that the verifier should start work- 
ing, while x, indicates that the verifier has successfully completed the verification. 
At the beginning of the execution, each process nondeterministically chooses to 
be either simulator, scheduler, or verifier. Each role will be described in the 
following. The complete construction is shown in Appendix C of [3]. 

The simulator uses the same states and transitions as (A), but instead of 
reading from and writing to registers, it uses the memory. If the simulator reaches 
the target state target, it first checks the x, flag. If it is already set, the simulator 
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stops, never reaching the final state qfinai. Otherwise, it waits until it observes the 
flag xe to be set. It then enters the final state. The scheduler’s only responsibility 
is to signal the start of the verification process. It does so by setting the flag x, 
at a nondeterministically chosen time during the execution of the program. The 
verifier waits until it observers the flag x,. It then starts the verification process, 
which consists of checking each variable that corresponds to a register. If all of 
them still contain their initial value, the verification was successful. The verifier 
signals this to the simulator process by setting the x, flag. 

Any execution ending in qfinaı Must perform a simulation of (A) ending in 
target first, then a scheduler propagates the setting of flag x, and afterwards 
a verifier executes. This ensures that the initial values are read by the verifier 
after the register machine has been simulated and thus the shared memory is 
unchanged. This means the simulator only accessed its write buffer and not 
writes from other threads. It follows that qtarget is reachable by R(A) if and only 
if qfinai is reachable by Proc under TSO(A). This gives us the following result: 


Theorem 2. R(A)-Reach is polynomial time reducible to TSO(A)-P-Reach. 


Theorem 1 and Theorem 2 give us a method of determining upper and 
lower bounds of the complexity of TSO(A)-P-Reach for different instantiations 
of ADT. Since we have reductions in both directions, we can conclude that 
TSO(A)-P-Reach is decidable if and only if R(A)-Reach is decidable. We know 
TSO(A)-P-Reach is PSPACE-hard for TSO(NoADT)-P-Reach where NOADT is 
the trivial ADT that models plain TSO semantics [4]. We can immediately de- 
rive a lower bound for any ADT: TSO(A)-P-Reach is PSPACE-hard. 


7 Instantiations of ADTs 


In the following, we instantiate our framework to a number of ADTs in order to 
show its applicability. 


Theorem 3. TSO(CT)-P-Reach and TSO(wCT) -P-Reach are PSPACE-complete. 


We know TSO(A)-P-Reach is PSPACE-hard for any ADT A including CT 
and WCT. Regarding the upper bound for CT, we can show that R(CT)-Reach 
can be polynomially reduced to R(NoADT)-Reach. The idea is to show that there 
is a bound on the counter values in order to find a witness for R(CT)-Reach. This 
bound is polynomial in the number of possible states and register assignments 
(i.e., this bound is at most exponential in the size of R(CT).) Assume a run that 
contains a configuration c with a value that exceeds the bound, then certain 
state and register assignment are repeated in the run with different values. We 
can use this to shorten the run such that the counter value in c is reduced. 

We can encode the counter value (up to this bound) in a binary way into 
registers acting as bits. The number of additional registers is polynomial in the 
size of R(CT). In order to simulate an inc operation on this binary encoding 
using WRITE and READ, we only have to go through the bits starting at the least 
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important bit and flip them until one is flipped from 0 to 1. The dec operation 
works analogously. This only requires a polynomial state and transition overhead. 
We know that R(NoApT)-Reach is in PSPACE[4]. It follows from the poly- 
nomial reduction that R(CT)-Reach is in PSPACE. Applying Theorem 1 gives 
us that TSO(CT)-P-Reach is in PSPACE. Since any WCT is a CT, it follows 
TSO(wCtT)-P-Reach is in PSPACE as well. The proof is in [3]. 


Theorem 4. TSO(StT)-P-Reach is ExPTIME-complete. 


For membership, we encode the registers of R(ST) in the states, which yields a 
finite state machine with access to a stack, i.e. a pushdown automaton. The con- 
struction has an exponential number of states. From [45], we have that checking 
the emptiness of a context-free language generated by a pushdown automaton is 
polynomial in terms of the size of the automaton. Combined, we get that state 
reachability of the constructed pushdown automaton is in EXPTIME. It follows 
that R(ST)-Reach is in EXPTIME (thanks to Theorem 1). 

To prove the lower bound, we can reduce the problem of checking the empti- 
ness of the intersection of a pushdown with n finite-state automata [35] to 
R(StT)-Reach. This problem is well-known to be ExPTIME-complete. The idea 
is to use the stack to simulate pushdown automaton and n registers to keep 
track of the states of the finite-state automata. We apply Theorem 2 and get 
TSO(St)-P-Reach is EXPTIME-hard. The formal proof is in [3] 


Theorem 5. TSO(PETRI)-P-Reach is EXPSPACE-complete. 


Proof. Petri net coverability is known to be EXPSPACE complete [26]. We show 
hardness by reducing coverability of a marking m to R(PETRI)-Reach. The idea 
is to construct a register machine with a Petri net as ADT. This register machine 
will have two states qinit and qfinal- For every transition t of the original Petri net, 
we have t: dinit $ dinit as a transition of the register machine (we simply simulate 


the original Petri net). Furthermore, we have qinit Se final AS a transition of 
the register machine. Thus, the state qfinaı can be reached iff m can be covered. 

We reduce reachability of (PETRI) to Petri net coverability. We construct 
the Petri net by taking the ADT PETRI and adding a place pg for every state 
q and a place Prega for every register reg € Regs and register value d € Dom. 
The idea is that a marking with a token in pg and one in Prega but none Preg,a’ 
for d' + d corresponds to a configuration of R(PETRI) with state q and reg = d. 
The value of PETRI is given by the remainder of the marking. 


We simulate any q oe q’ with a transition t that takes one token from q 
and puts one in q’. If instr € Ops, then instr is a Petri net transition. We simply 
add the same input and output arcs to t. To simulate a write, we add a new 
transition ty for every d’ € Dom with an arc to Prega and an arc from Preg,d'- 
The initial marking is consistent with valp ™ and has one token in Pan- A state 
q is reachable if a marking with one token in pg is coverable. 
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Higher Order ADTs. Let M(A)-Reach problem be the restriction of R(A)-Reach 
with no registers. The M(A)-Reach problem has been studied for many ADT 
such as higher order counter and higher order stack variations[34,25]. 


Theorem 6. 


— TSO(n-StT) -P-Reach is (n — 1)-ExPTIME-hard and in n-EXPTIME. 
— TSO(n-wCT)-P-Reach is (n — 2)-EXPTIME-hard and in (n — 1)-EXPTIME. 
— TSO(n-CT) -P-Reach is (n — 2)-EXPSPACE-hard and in (n — 1)-EXPSPACE. 


Proof. M(n-StT)-Reach has been shown to be (n — 1)-EXPTIME-complete [25]. 
We know M (n-wCT)-Reach is (n — 2)-ExPTIME-complete and M(n-CT)-Reach 
is (n — 2)-EXPSPACE-complete [34]. Since the reduction from M(A)-Reach to 
R(A)-Reach is trivial, any hardness result can be applied to TSO(A)-P-Reach 
immediately using Theorem 2. In order to reduce R(A)-Reach to M(A)-Reach, 
we encode register assignments into the state which results in an exponential 
state explosion. Then we apply Theorem 1 to obtain our upper bound. 


Theorem 7. TSO(n-OMST) -P-Reach is 2-ETIME-complete. 


Proof. We know that M(n-OMST)-Reach is 2-ETIME-complete [12] and we can 
apply Theorem 2 to get 2-ETIME-hardness. According to Theroem 4.6 in [11], 
M(n-OMSt)-Reach is in O(|M(A)|2°") for some constant d € N. We apply 
the exponential size reduction to R(n-OMST)-Reach and Theorem 1 and get 
TSO(n-OMST)-P-Reach is in O((2!7!)2"") = O(2IPI-2"") and thus it is also in 
o(227!:2%) = ogre). Thus, TSO(n-OMST)-P-Reach is in 2-E TIME. 


We study well structured ADTs [29,9] as defined in [3]: 
Theorem 8. If ADT A is well structured, then TSO(A)-P-Reach is decidable. 


A register machine for a well structured ADT A is equivalent to the composition 
of a well structured transition system (WSTS) modeling A and a finite transition 
system (and thus a WSTS) that models states and registers. According to [9], the 
composition is again a WSTS and reachability is decidable. The above theorem 
is then an immediate corollary of Theorem 1. 


8 Conclusions and Future Work 


In this paper, we have taken the first step to studying the complexity of param- 
eterized verification under weak memory models when the processes manipulate 
unbounded data domains. Concretely, we have presented complexity results for 
parameterized concurrent programs running on the classical TSO memory model 
when the processes operate on an abstract data type. We reduce the problem to 
reachability for register machines enriched with the given abstract data type. 
State reachability for finite automata with ADT has been extensively stud- 
ied for many ADTs[34,25]. We have shown in Theorem 6 that we can apply 
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our framework to existing complexity results of this problem. This provides 
us with decidability and complexity results for the corresponding instances of 
TSO(A)-P-Reach. However, due to the exponential number of register assign- 
ments, the upper bound is exponentially larger than the lower bound. We aim 
to study these cases further and determine more refined parametric bounds. 

A direction for future work is considering other memory models, such as the 
partial store ordering semantics, the release-acquire semantics, and the ARM 
semantics. It is also interesting to re-consider the problem under the assumption 
of having distinguished processes (so-called leader processes). Adding leaders is 
known to make the parameterized verification problem harder. The complex- 
ity/decidability of parameterized verification under TSO with a single leader is 
open, even when the processes are finite-state. 


References 


1. Parosh Aziz Abdulla. Regular model checking. STTT, 14(2):109-118, 2012. 

2. Parosh Aziz Abdulla, Mohamed Faouzi Atig, Ahmed Bouajjani, and Tuan Phong 
Ngo. A load-buffer semantics for total store ordering. LMCS, 14(1), 2018. 

3. Parosh Aziz Abdulla, Mohamed Faouzi Atig, Florian Furbach, Adwait Godbole, 
Yacoub G. Hendi, Shankaranarayanan Krishna, and Stephan Spengler. Parameter- 
ized verification under tso with data types. arXiv e-prints, 2023. arXiv:2302.02163. 

4. Parosh Aziz Abdulla, Mohamed Faouzi Atig, and Rojin Rezvan. Parameterized 
verification under tso is pspace-complete. Proc. ACM Program. Lang., 4(POPL), 
2019. 

5. Parosh Aziz Abdulla, Yu-Fang Chen, Giorgio Delzanno, Frédéric Haziza, Chih- 
Duo Hong, and Ahmed Rezine. Constrained monotonic abstraction: A CEGAR 
for parameterized verification. In CONCUR 2010, pages 86-101, 2010. 

6. Parosh Aziz Abdulla and Giorgio Delzanno. Parameterized verification. STTT, 
18(5):469-473, 2016. 

7. Parosh Aziz Abdulla, Frédéric Haziza, and Lukas Holík. Parameterized verification 
through view abstraction. STTT, 18(5):495-516, 2016. 

8. Parosh Aziz Abdulla, A. Prasad Sistla, and Muralidhar Talupur. Model checking 
parameterized systems. In Handbook of Model Checking, pages 685-725. Springer, 
2018. 

9. Parosh Aziz Abdulla, Kārlis Cerans, Bengt Jonsson, and Yih-Kuen Tsay. Al- 
gorithmic analysis of programs with well quasi-ordered domains. Inf. Comput., 
160:109-127, 2000. 

10. Krzysztof R. Apt and Dexter Kozen. Limits for automatic verification of finite- 
state concurrent systems. Inf. Process. Lett., 22(6):307—309, 1986. 

11. Mohamed Faouzi Atig. Model-Checking of Ordered Multi-Pushdown Automata. 
LMCS, Volume 8, Issue 3, 2012. 

12. Mohamed Faouzi Atig, Benedikt Bollig, and Peter Habermehl. Emptiness of multi- 
pushdown automata is 2etime-complete. In Developments in Language Theory, 
pages 121-133. Springer, 2008. 

13. Roderick Bloem, Swen Jacobs, Ayrat Khalimov, Igor Konnov, Sasha Rubin, Hel- 
mut Veith, and Josef Widder. Decidability in parameterized verification. SIGACT 
News, 47(2):53-64, 2016. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


Parameterized Verication under TSO with Data Types 605 


Bernard Boigelot, Axel Legay, and Pierre Wolper. Iterating transducers in the large 
(extended abstract). In CAV, volume 2725 of LNCS, pages 223-235. Springer, 2003. 
Ahmed Bouajjani, Egor Derevenetc, and Roland Meyer. Checking and enforcing 
robustness against TSO. In ETAPS, pages 533-553, 2013. 

Ahmed Bouajjani, Peter Habermehl, Adam Rogalewicz, and Tomás Vojnar. Ab- 
stract regular (tree) model checking. STTT, 14(2):167-191, 2012. 

Sebastian Burckhardt. Principles of eventual consistency. FTPL, 1(1-2):1-150, 
2014. 

Thierry Cachat and Igor Walukiewicz. The complexity of games on higher order 
pushdown automata. CoRR, abs/0705.0262, 2007. 

Sylvain Conchon, David Declerck, and Fatiha Zaidi. Parameterized model checking 
on the tso weak memory model. J. Autom. Reason., 64(7):1307—1330, 2020. 
Giorgio Delzanno, Arnaud Sangnier, and Gianluigi Zavattaro. Parameterized ver- 
ification of ad hoc networks. In CONCUR, pages 313-327, 2010. 

Marco Elver and Vijay Nagarajan. TSO-CC: consistency directed cache coherence 
for TSO. In HPCA, pages 165-176. IEEE, 2014. 

E. Allen Emerson, John Havlicek, and Richard J. Trefler. Virtual symmetry re- 
duction. In LICS, pages 121-131, 2000. 

E. Allen Emerson and Vineet Kahlon. Exact and efficient verification of param- 
eterized cache coherence protocols. In CHARME, volume 2860 of LNCS, pages 
247-262. Springer, 2003. 

E. Allen Emerson and Vineet Kahlon. Parameterized model checking of ring-based 
message passing systems. In CSL, volume 3210 of LNCS, pages 325-339. Springer, 
2004. 

Joost Engelfriet. Iterated stack automata and complexity classes. Inf. Comput., 
95(1):21-75, 1991. 

Javier Esparza. Decidability and complexity of petri net problems - an introduc- 
tion. LNCS, 1491, 2000. 

Javier Esparza, Alain Finkel, and Richard Mayr. On the verification of broadcast 
protocols. In LICS, pages 352-359. IEEE Computer Society, 1999. 

Javier Esparza, Pierre Ganty, and Rupak Majumdar. Parameterized verification 
of asynchronous shared-memory systems. J. ACM, 63(1):10:1-10:48, 2016. 

A. Finkel and Ph. Schnoebelen. Well-structured transition systems everywhere! 
Theoretical Computer Science, 256(1):63-92, 2001. ISS. 

Marie Fortin, Anca Muscholl, and Igor Walukiewicz. Model-checking linear-time 
properties of parametrized asynchronous shared-memory pushdown systems. In 
CAV, pages 155-175, 2017. 

Pierre Ganty and Rupak Majumdar. Algorithmic verification of asynchronous 
programs. ACM Trans. Program. Lang. Syst., 34(1):6:1-6:48, 2012. 

Steven M. German and A. Prasad Sistla. Reasoning about systems with many 
processes. J. ACM, 39(3):675-735, 1992. 

Matthew Hague. Parameterised pushdown systems with non-atomic writes. In 
FSTTCS, pages 457—468, 2011. 

Alexander Heufner and Alexander Kartzow. Reachability in higher-order-counters. 
In MFCS, pages 528-539. Springer, 2013. 

Alexander HeufS ner, Jérôme Leroux, Anca Muscholl, and Grégoire Sutre. Reacha- 
bility analysis of communicating pushdown systems. In FOSSACS, pages 267-281. 
Springer, 2010. 

Vineet Kahlon. Parameterization as abstraction: A tractable approach to the 
dataflow analysis of concurrent programs. In LICS, pages 181-192, 2008. 


606 P. A. Abdulla et al. 


37. Alexander Kaiser, Daniel Kroening, and Thomas Wahl. Dynamic cutoff detection 
in parameterized concurrent programs. In CAV, volume 6174 of LNCS, pages 
645-659. Springer, 2010. 

38. Yonit Kesten, Oded Maler, Monica Marcus, Amir Pnueli, and Elad Shahar. Sym- 
bolic model checking with rich assertional languages. Theor. Comput. Sci., 256(1- 
2):93-112, 2001. 

39. Shankara Narayanan Krishna, Adwait Godbole, Roland Meyer, and Soham 
Chakraborty. Parameterized verification under release acquire is pspace-complete. 
In PODC, pages 482-492. ACM, 2022. 

40. Salvatore La Torre, Anca Muscholl, and Igor Walukiewicz. Safety of parametrized 
asynchronous shared-memory systems is almost always decidable. In CONCUR, 
pages 72-84, 2015. 

41. Ori Lahav, Nick Giannarakis, and Viktor Vafeiadis. Taming release-acquire con- 
sistency. In SIGPLAN-SIGACT, pages 649-662. ACM, 2016. 

42. Anca Muscholl, Helmut Seidl, and Igor Walukiewicz. Reachability for dynamic 
parametric processes. In VMCATI, pages 424-441, 2017. 

43. Kedar S. Namjoshi and Richard J. Trefler. Parameterized compositional model 
checking. In ETAPS, volume 9636 of LNCS, pages 589-606. Springer, 2016. 

44. J. L. Peterson. Petri Net Theory and the Modeling of Systems. Prentice Hall PTR, 
1981. 

45. Ahmed Bouajjani Rajeev Alur and Javier Esparza. Handbook of Model Checking, 
chapter Model Checking Procedural Programs, pages 547-569. Springer, 2018. 

46. Alberto Ros and Stefanos Kaxiras. Racer: TSO consistency via race detection. In 
MICRO, pages 33:1-33:13. IEEE Computer Society, 2016. 

47. Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 
Understanding POWER multiprocessors. In ACM SIGPLAN, PLDI, pages 175- 
186. ACM, 2011. 

48. Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Mag- 
nus O. Myreen. x86-tso: a rigorous and usable programmer’s model for x86 mul- 
tiprocessors. Commun. ACM, 53(7):89-97, 2010. 


Open Access This chapter is licensed under the terms of the Creative Commons Attri- 
bution 4.0 International License (http: //creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter’s Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will 
need to obtain permission directly from the copyright holder. 


Check for 
updates 


Verifying Learning-Based 
Robotic Navigation Systems 


Guy Amir!*), Davide Corsi?*, Raz Yerushalmi!*, Luca Marzari?, 
David Harel?, Alessandro Farinelli?, and Guy Katz! 


1 The Hebrew University of Jerusalem, Jerusalem, Israel 
{guyam, guykatz}@cs.huji.ac.il 
? University of Verona, Verona, Italy 
{davide.corsi,luca.marzari,alessandro.farinelli}@univr.it 
3 The Weizmann Institute of Science, Rehovot, Israel 
{raz.yerushalmi,david.harel}@weizmann.ac.il 


Abstract. Deep reinforcement learning (DRL) has become a dominant 
deep-learning paradigm for tasks where complex policies are learned 
within reactive systems. Unfortunately, these policies are known to be 
susceptible to bugs. Despite significant progress in DNN verification, 
there has been little work demonstrating the use of modern verification 
tools on real-world, DRL-controlled systems. In this case study, we at- 
tempt to begin bridging this gap, and focus on the important task of 
mapless robotic navigation — a classic robotics problem, in which a 
robot, usually controlled by a DRL agent, needs to efficiently and safely 
navigate through an unknown arena towards a target. We demonstrate 
how modern verification engines can be used for effective model selection, 
i.e., selecting the best available policy for the robot in question from a 
pool of candidate policies. Specifically, we use verification to detect and 
rule out policies that may demonstrate suboptimal behavior, such as col- 
lisions and infinite loops. We also apply verification to identify models 
with overly conservative behavior, thus allowing users to choose supe- 
rior policies, which might be better at finding shorter paths to a target. 
To validate our work, we conducted extensive experiments on an ac- 
tual robot, and confirmed that the suboptimal policies detected by our 
method were indeed flawed. We also demonstrate the superiority of our 
verification-driven approach over state-of-the-art, gradient attacks. Our 
work is the first to establish the usefulness of DNN verification in iden- 
tifying and filtering out suboptimal DRL policies in real-world robots, 
and we believe that the methods presented here are applicable to a wide 
range of systems that incorporate deep-learning-based agents. 


1 Introduction 


In recent years, deep neural networks (DNN) have become extremely popular, 
due to achieving state-of-the-art results in a variety of fields — such as natural 
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language processing [16], image recognition [51], autonomous driving [11], and 
more. The immense success of these DNN models is owed in part to their ability 
to train on a fixed set of training samples drawn from some distribution, and then 
generalize, i.e., correctly handle inputs that they had not encountered previously. 
Notably, deep reinforcement learning (DRL) [37] has recently become a dominant 
paradigm for training DNNs that implement control policies for complex systems 
that operate within rich environments. One domain in which DRL controllers 
have been especially successful is robotics, and specifically — robotic navigation, 
i.e., the complex task of efficiently navigating a robot through an arena, in order 
to safely reach a target [63,68]. 


Unfortunately, despite the immense success of DNNs, they have been shown 
to suffer from various safety issues [31,57]. For example, small perturbations 
to their inputs, which are either intentional or the result of noise, may cause 
DNNs to react in unexpected ways [45]. These inherent weaknesses, and others, 
are observed in almost every kind of neural network, and indicate a need for 
techniques that can supply formal guarantees regarding the safety of the DNN 
in question. These weaknesses have also been observed in DRL systems [6,21,34], 
showing that even state-of-the-art DRL models may err miserably. 


To mitigate such safety issues, the verification community has recently de- 
veloped a plethora of techniques and tools [8, 10, 19,24, 28, 29, 31,35, 39, 40, 64, 66] 
for formally verifying that a DNN model is safe to deploy. Given a DNN, these 
methods usually check whether the DNN: (i) behaves according to a prescribed 
requirement for all possible inputs of interest; or (ii) violates the requirement, 
in which case the verification tool also provides a counterexample. 


To date, despite the abundance of both DRL systems and DNN verification 
techniques, little work has been published on demonstrating the applicability 
and usefulness of verification techniques to real-world DRL systems. In this case 
study, we showcase the capabilities of DNN verification tools for analyzing DRL- 
based systems in the robotics domain — specifically, robotic navigation systems. 
To the best of our knowledge, this is the first attempt to demonstrate how off- 
the-shelf verification engines can be used to identify both unsafe and subopti- 
mal DRL robotic controllers, that cannot be detected otherwise using existing, 
incomplete methods. Our approach leverages existing DNN verifiers that can 
reason about single and multiple invocations of DRL controllers, and this allows 
us to conduct a verification-based model selection process — through which we 
filter out models that could render the system unsafe. 


In addition to model selection, we demonstrate how verification methods al- 
low gaining better insights into the DRL training process, by comparing the 
outcomes of different training methods and assessing how the models improve 
over additional training iterations. We also compare our approach to gradient- 
based methods, and demonstrate the advantages of verification-based tools in 
this setting. We regard this as another step towards increasing the reliability 
and safety of DRL systems, which is one of the key challenges in modern ma- 
chine learning [27]; and also as a step toward a more wholesome integration of 
verification techniques into the DRL development cycle. 
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In order to validate our experiments, we conducted an extensive evaluation 
on a real-world, physical robot. Our results demonstrate that policies classified 
as suboptimal by our approach indeed exhibited unwanted behavior. This eval- 
uation highlights the practical nature of our work; and is summarized in a short 
video clip [4], which we strongly encourage the reader to watch. In addition, our 
code and benchmarks are available online [3]. 

The rest of the paper is organized as follows. Section 2 contains background 
on DNNs, DRLs, and robotic controlling systems. In Section 3 we present our 
DRL robotic controller case study, and then elaborate on the various properties 
that we considered in Section 4. In Section 5 we present our experimental results, 
and use them to compare our approach with competing methods. Related work 
appears in Section 6, and we conclude in Section 7. 


2 Background 


Deep Neural Networks. Deep neural networks (DNNs) [25] are computa- 
tional, directed, graphs consisting of multiple layers. By assigning values to the 
first layer of the graph and propagating them through the subsequent layers, 
the network computes either a label prediction (for a classification DNN) or a 
value (for a regression DNN), which is returned to the user. The values com- 
puted in each layer depend on values computed in previous layers, and also on 
the current layer’s type. Common layer types include the weighted sum layer, in 
which each neuron is an affine transformation of the neurons from the preceding 
layer; as well as the popular rectified linear unit (ReLU) layer, where each node 
y computes the value y = ReLU(x) = max(0, x), based on a single node x from 
the preceding layer to which it is connected. The DRL systems that are the sub- 
ject matter of this case study consist solely of weighted sum and ReLU layers, 
although the techniques mentioned are suitable for DNNs with additional layer 


types, as we discuss later. Weighted 


Fig. 1 depicts a small example of a input sum RELY Output 
DNN. For input V; = [2,3]", the sec- a eo WE 
ond (weighted sum) layer computes BS < i BN @ 
the values V2 = [20, —7]”.. In the third 4 2° 
layer, the ReLU functions are applied, ca 1 cf ReLU ob 
and the result is V3 = [20, 0]”. Finally, 2 
the network’s single output is com- Fig. 1: A toy DNN. 


puted as a weighted sum: V4 = [40]. 


Deep Reinforcement Learning. Deep reinforcement learning (DRL) [37] is a 
particular paradigm and setting for training DNNs. In DRL, an agent is trained 
to learn a policy m, which maps each possible environment state s (i.e., the 
current observation of the agent) to an action a. The policy can have different 
interpretations among various learning algorithms. For example, in some cases, 
am represents a probability distribution over the action space, while in others it 
encodes a function that estimates a desirability score over all the future actions 
from a state s. 
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During training, at each discrete time-step t € {0,1,2,...}, a reward r; is 
presented to the agent, based on the action a; it performed at time-step t. Dif- 
ferent DRL training algorithms leverage the reward in different ways, in order to 
optimize the DNN-agent’s parameters during training. The general DNN archi- 
tecture described above also characterizes DRL-trained DNNs; the uniqueness 
of the DRL paradigm lies in the training process, which is aimed at generat- 
ing a DNN that computes a mapping 7 that maximizes the expected cumulative 
discounted reward Ry = [D . ri]. The discount factor, y € [0, 1], is a hy- 
perparameter that controls the influence that past decisions have on the total 
expected reward. 

DRL training algorithms are typically divided into three categories [55]: 


1. Value-Based Algorithms. These algorithms attempt to learn a value func- 
tion (called the Q-function) that assigns a value to each (state,action) pair. 
This iterative process relies on the Bellman equation [44] to update the 
function: Q7 (s+, a4) = r +ymaxg,,, Q7 (S441, a¢41). Double Deep Q-Network 
(DDQN) is an optimized implementation of this algorithm [60]. 

2. Policy-Gradient Algorithms. This class contains algorithms that attempt 
to directly learn the optimal policy, instead of assessing the value func- 
tion. The algorithms in this class are typically based on the policy gradi- 
ent theorem [56]. A common implementation is the Reinforce algorithm [67], 
which aims to directly optimize the following objective function, over the 
parameters 6 of the DNN, through a gradient ascent process: VoJ(m9) = 

ae Vo log Tto (ar|s+) - rz]. For additional details, see [67]. 

3. Actor-Critic Algorithms. This family of hybrid algorithms combines the 
two previous approaches. The key idea is to use two different neural networks: 
a critic, which learns the value function from the data, and an actor, which 
iteratively improves the policy by maximizing the value function learned by 
the critic. A state-of-the-art implementation of this approach is the Proximal 
Policy Optimization (PPO) algorithm [50]. 


All of these approaches are commonly used in modern DRL; and each has its 
advantages and disadvantages. For example, the value-based methods typically 
require only small sets of examples to learn from, but are unable to learn policies 
for continuous spaces of (state,action) pairs. In contrast, the policy-gradient 
methods can learn continuous policies, but suffer from a low sample efficiency 
and large memory requirements. Actor-Critic algorithms attempt to combine 
the benefits of value-based and policy-gradient methods, but suffer from high 
instability, particularly in the early stages of training, when the value function 
learned by the critic is unreliable. 


DNN Verification and DRL Verification. A DNN verification algorithm 
receives as input [31]: (i) a trained DNN N; (ii) a precondition P on the DNN’s 
inputs, which limits their possible assignments to inputs of interest; and (iii) a 
postcondition Q on N’s output, which usually encodes the negation of the be- 
havior we would like N to exhibit on inputs that satisfy P. The verification 
algorithm then searches for a concrete input zo that satisfies P(xo) A Q(N(2a0)), 


Verifying Learning-Based Robotic Navigation Systems 611 


and returns one of the following outputs: (i) SAT, along with a concrete input 
xo that satisfies the given constraints; or (ii) UNSAT, indicating that no such 2 
exists. When Q encodes the negation of the required property, a SAT result in- 
dicates that the property is violated (and the returned input zo triggers a bug), 
while an UNSAT result indicates that the property holds. 

For example, suppose we wish to verify that the DNN in Fig. 1 always outputs 
a value strictly smaller than 7; i.e., that for any input x = (vt, v7), it holds that 
N(x) =v} < 7. This is encoded as a verification query by choosing a precondition 
that does not restrict the input, i.e., P = (true), and by setting Q = (vj > 7), 
which is the negation of our desired property. For this verification query, a sound 
verifier will return SAT, alongside a feasible counterexample such as x = (0, 2), 
which produces vj = 22 > 7. Hence, the property does not hold for this DNN. 

To date, the DNN verification community has focused primarily on DNNs 
used for a single, non-reactive, invocation [24,28,31,40,64]. Some work has been 
carried out on verifying DRL networks, which pose greater challenges: beyond 
the general scalability challenges of DNN verification, in DRL verification we 
must also take into account that agents typically interact with a reactive envi- 
ronment [6,9,15,21,30]. In particular, these agents are implemented with neural 
networks that are invoked multiple times, and the inputs of each invocation are 
usually affected by the outputs of the previous invocations. This fact aggre- 
gates the scalability limitations (because multiple invocations must be encoded 
in each query), and also makes the task of defining P and Q significantly more 
complex [6]. 


3 Case Study: Robotic Mapless Navigation 


Robotis Turtlebot 3. In our case study, we focus on the Robotis Turtlebot 3 
robot (Turtlebot, for short), depicted in Fig. 2. Given its relatively low cost and 
efficient sensor configuration, this robot is widely used in robotics research [7,46]. 
In particular, this robotic platform has the actuators required for moving and 
turning, as well as multiple lidar sensors for detecting obstacles. These sensors 
use laser beams to approximate the distance to the nearest object in their direc- 
tion [65]. In our experiments, we used a configuration with seven lidar sensors, 
each with a maximal range of one meter. Each pair of sensors are 30° apart, 
thus allowing coverage of 180°. The images in Fig. 3 depict a simulation of the 
Turtlebot navigating through an arena, and highlight the lidar beams. See the 
full version of this paper [5] for additional details. 


The Mapless Navigation Problem. Robotic navigation is the task of navi- 
gating a robot (in our case, the Turtlebot) through an arena. The robot’s goal 
is to reach a target destination while adhering to predefined restrictions; e.g., 
selecting as short a path as possible, avoiding obstacles, or optimizing energy 
consumption. In recent years, robotic navigation tasks have received a great deal 
of attention [63,68], primarily due to their applicability to autonomous vehicles. 
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Fig. 2: The Robotis Turtlebot 3 platform, navigating in an arena. The image on 
the left depicts a static robot, and the image on the right depicts the robot 
moving towards the destination (the yellow square), while avoiding two wooden 
obstacles in its route. 


We study here the popular mapless variant of the robotic navigation problem, 
where the robot can rely only on local observations (i.e., its sensors), without 
any information about the arena’s structure or additional data from external 
sources. In this setting, which has been studied extensively [58], the robot has 
access to the relative location of the target, but does not have a complete map of 
the arena. This makes mapless navigation a partially observable problem, and 
among the most challenging tasks to solve in the robotics domain [13, 58, 70]. 


DRL-Controlled Mapless Navigation. State-of-the-art solutions to map- 
less navigation suggest training a DRL policy to control the robot. Such DRL- 
based solutions have obtained outstanding results from a performance point of 
view [47]. For example, recent work by Marchesini et al. [43] has demonstrated 
how DRL-based agents can be applied to control the Turtlebot in a mapless 
navigation setting, by training a DNN with a simple architecture, including two 
hidden layers. Following this recent work, in our case study we used the following 
topology for DRL policies: 


— An input layer with nine neurons. These include seven neurons representing 
the Turtlebot’s lidar readings. The additional, non-lidar inputs include one 
neuron representing the relative angle between the robot and the target, and 
one neuron representing the robot’s distance from the target. A scheme of 
the inputs appears in Fig. 4a. 

— Two subsequent fully-connected layers, each consisting of 16 neurons, and 
followed by a ReLU activation layer. 

— An output layer with three neurons, each corresponding to a different (dis- 
crete) action that the agent can choose to execute in the following step: move 
FORWARD, turn LEFT, or turn RIGHT.! 


1 Tt has been shown that discrete controllers achieve excellent performance in robotic 
navigation, often outperforming continuous controllers in a large variety of tasks [43]. 
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Fig. 3: An example of a simulated Turtlebot entering a 2-step loop. The white 
and red dashed lines represent the lidar beams (white indicates “clear”, and red 
indicates that an obstacle is detected). The yellow square represents the target 
position; and the blue arrows indicate rotation. In the first row, from left to 
right, the Turtlebot is stuck in an infinite loop, alternating between right and 
left turns. Given the deterministic nature of the system, the agent will continue 
to select these same actions, ad infinitum. In the second row, from left to right, 
we present an almost identical configuration, but with an obstacle located 30° 
to the robot’s left (circled in blue). The presence of the obstacle changes the 
input to the DNN, and allows the Turtlebot to avoid entering the infinite loop; 
instead, it successfully navigates to the target. 


While the aforementioned DRL topology has been shown to be efficient for 
robotic navigation tasks, finding the optimal training algorithm and reward func- 
tion is still an open problem. As part of our work, we trained multiple deter- 
ministic policies using the DRL algorithms presented in Section 2: DDQN [60], 
Reinforce [67], and PPO [50]. For the reward function, we used the following 
formulation: 

R; = (dt-1 = di) -a — Ê, 


where d; is the distance from the target at time-step t; a is a normalization factor 
used to guarantee the stability of the gradient; and £ is a fixed value, decreased 
at each time-step, and resulting in a total penalty proportional to the length 
of the path (by minimizing this penalty, the agent is encouraged to reach the 
target quickly). In our evaluation, we empirically selected a = 3 and 3 = 0.001. 
Additionally, we added a final reward of +1 when the robot reached the target, 
or —1 in case it collided with an obstacle. For additional information regarding 
the training phase, see the full version of this paper [5]. 


DRL Training and Results. Using the training algorithms mentioned in Sec- 
tion 2, we trained a collection of DRL agents to solve the Turtlebot mapless 
navigation problem. We ran a stochastic training process, and thus obtained 
varied agents; of these, we only kept those that achieved a success rate of at 
least 96% during training. A total of 780 models were selected, consisting of 
260 models per each of the three training algorithms. More specifically, for each 
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Fig. 4: (a) The DRL controller used for the robot in our case study. The DRL 
has nine input neurons: seven lidar sensor readings (blue), one input indicating 
the relative angle (orange) between the robot and the target, and one input 
indicating the distance (green) between the robot and the target. (b) The average 
success rates of models trained by each of the three DRL training algorithms, 
per training episode. 


algorithm, all 260 models were generated from 52 random seeds. Each seed gave 
rise to a family of 5 models, where the individual family members differ in the 
number of training episodes used for training them. Fig. 4b shows the trained 
models’ average success rate, for each algorithm used. We note that PPO was 
generally the fastest to achieve high accuracy. However, all three training algo- 
rithms successfully produced highly accurate agents. 


4 Using Verification for Model Selection 


All of our trained models achieved very high success rates, and so, at face value, 
there was no reason to favor one over the other. However, as we show next, a 
verification-based approach can expose multiple subtle differences between them. 
As our evaluation criteria, we define two properties of interest that are derived 
from the main goals of the robotic controller: (i) reaching the target; and (ii) 
avoiding collision with obstacles. Employing verification, we use these criteria to 
identify models that may fail to fulfill their goals, e.g., because they collide with 
various obstacles, are overly conservative, or may enter infinite loops without 
reaching the target. We now define the properties that we used, and the results 
of their verification are discussed in Section 5. Additional details regarding the 
precise encoding of our queries appear the full version of this paper [5]. 


Collision Avoidance. Collision avoidance is a fundamental and ubiquitous 
safety property [14] for navigation agents. In the context of Turtlebot, our goal 
is to check whether there exists a setting in which the robot is facing an obstacle, 
and chooses to move forward — even though it has at least one other viable 
option, in the form of a direction in which it is not blocked. In such situations, 
it is clearly preferable to choose to turn LEFT or RIGHT instead of choosing to 
move FORWARD and collide. See Fig. 5 for an illustration. 
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Fig. 5: Example of a single-step collision. The robot is not blocked on its right 
and can avoid the obstacle by turning (panel A), but it still chooses to move 
forward — and collides (panel B). 


Given that turning LEFT or RIGHT produces an in-place rotation (i.e., the 
robot does not change its position), the only action that can cause a collision 
is FORWARD. In particular, a collision can happen when an obstacle is directly in 
front of the robot, or is slightly off to one side (just outside the front lidar’s field 
of detection). More formally, we consider the safety property “the robot does not 
collide at the next step”, with three different types of collisions: 


— FORWARD COLLISION: the robot detects an obstacle straight ahead, but nev- 
ertheless makes a step forward and collides with the obstacle. 

— LEFT COLLISION: the robot detects an obstacle ahead and slightly shifted 
to the left (using the lidar beam that is 30° to the left of the one point- 
ing straight ahead), but makes a single step forward and collides with the 
obstacle. The shape of the robot is such that in this setting, a collision is 
unavoidable. 

— RIGHT COLLISION: the robot detects an obstacle ahead and slightly shifted 
to the right, but makes a single step forward and collides with the obstacle. 


Recall that in mapless navigation, all observations are local — the robot has 
no sense of the global map, and can encounter any possible obstacle configu- 
ration (i.e., any possible sensor reading). Thus, in encoding these properties, 
we considered a single invocation of the DRL agent’s DNN, with the following 
constraints: 


1. All the sensors that are not in the direction of the obstacle receive a lidar 
input indicating that the robot can move either LEFT or RIGHT without risk 
of collision. This is encoded by lower-bounding these inputs. 

2. The single input in the direction of the obstacle is upper-bounded by a value 
matching the representation of an obstacle, close enough to the robot so that 
it will collide if it makes a move FORWARD. 

3. The input representing the distance to the target is lower-bounded, indicat- 
ing that the target has not yet been reached (encouraging the agent to make 
a move). 
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The exact encoding of these properties is based on the physical characteristics 
of the robot and the lidar sensors, as explained in the full version of this paper [5]. 


Infinite Loops. Whereas collision avoidance is the natural safety property to 
verify in mapless navigation controllers, checking that progress is eventually 
made towards the target is the natural liveness property. Unfortunately, this 
property is difficult to formulate due to the absence of a complete map. Instead, 
we settle for a weaker property, and focus on verifying that the robot does not 
enter infinite loops (which would prevent it from ever reaching the target). 

Unlike the case of collision avoidance, where a single step of the DRL agent 
could constitute a violation, here we need to reason about multiple consecutive 
invocations of the DRL controller, in order to identify infinite loops. This, again, 
is difficult to encode due to the absence of a global map, and so we focus on 
in-place loops: infinite sequences of steps in which the robot turns LEFT and 
RIGHT, but without ever moving FORWARD, thus maintaining its current location 
ad infinitum. 

Our queries for identifying in-place loops encode that: (i) the robot does 
not reach the target in the first step; (ii) in the following k steps, the robot 
never moves FORWARD, i.e., it only performs turns; and (iii) the robot returns 
to an already-visited configuration, guaranteeing that the same behavior will be 
repeated by our deterministic agents. The various queries differ in the choice of 
k, as well as in the sequence of turns performed by the robot. Specifically, we 
encode queries for identifying the following kinds of loops: 


— ALTERNATING LOOP: a loop where the robot performs an infinite sequence of 
(LEFT, RIGHT, LEFT, RIGHT, LEFT...) moves. A query for identifying this loop 
encodes k = 2 consecutive invocations of the DRL agent, after which the 
robot’s sensors will again report the exact same reading, leading to an infinite 
loop. An example appears in Fig. 3. The encoding uses the “sliding window” 
principle, on which we elaborate later. 

— LEFT CYCLE, RIGHT CYCLE: loops in which the robot performs an infinite 
sequence of (LEFT, LEFT, LEFT,...) or (RIGHT, RIGHT, RIGHT,...) operations 
accordingly. Because the Turtlebot turns at a 30° angle, this loop is encoded 
as a sequence of k = 360°/30° = 12 consecutive invocations of the DRL 
agent’s DNN, all of which produce the same turning action (either LEFT or 
RIGHT). Using the sliding window principle guarantees that the robot returns 
to the same exact configuration after performing this loop, indicating that 
it will never perform any other action. 


We also note that all the loop-identification queries include a condition for 
ensuring that the robot is not blocked from all directions. Consequently, any 
loops that are discovered demonstrate a clearly suboptimal behavior. 


Specific Behavior Profiles. In our experiments, we noticed that the safe poli- 
cies, i.e., the ones that do not cause the robot to collide, displayed a wide spec- 
trum of different behaviors when navigating to the target. These differences 
occurred not only between policies that were trained by different algorithms, 
but also between policies trained by the same reward strategy — indicating that 
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these differences are, at least partially, due to the stochastic realization of the 
DRL training process. 

Specifically, we noticed high vari- 
ability in the length of the routes se- 
lected by the DRL policy in order | neers 
to reach the given target: while some 
policies demonstrated short, efficient, 
paths that passed very close to ob- 
stacles, other policies demonstrated a 
much more conservative behavior, by 
selecting longer paths, and avoiding 
getting close to obstacles (an example 
appears in Fig. 6). 

Thus, we used our verification- 


Fig.6: Comparing paths selected by 
policies with different bravery levels. 
driven approach to quantify how con- path A takes the Turtlebot close to the 
servative the learned DRL agent is obstacle (red area), and is the short- 
in the mapless navigation setting. In- est. Path B maintains a greater dis- 
tuitively, a highly conservative pol- tance from the obstacle (light red area), 
icy will keep a significant safety mar- and is consequently longer. Finally, path 
gin from obstacles (possibly taking a © maintains such a significant distance 


longer route to reach its destination), from the obstacle (white area) that it is 
whereas a “braver” and less conser- unable to reach the target. 


vative controller would risk venturing 

closer to obstacles. In the case of Turtlebot, the preferable DRL policies are the 
ones that guarantee the robot’s safety (with respect to collision avoidance), and 
demonstrate a high level of bravery — as these policies tend to take shorter, op- 
timized paths (see path A in Fig. 6), which lead to reduced energy consumption 
over the entire trail. 

Bravery assessment is performed by encoding verification queries that identify 
situations in which the Turtlebot can move forward, but its control policy chooses 
not to. Specifically, we encode single invocations of the DRL model, in which we 
bound the lidar inputs to indicate that the Turtlebot is sufficiently distant from 
any obstacle and can safely move forward. We then use the verifier to determine 
whether, in this setting, a FORWARD output is possible. By altering and adjusting 
the bounds on the central lidar sensor, we can control how far away the robot 
perceives the obstacle to be. If we limit this distance to large values and the 
policy will still not move FORWARD, it is considered conservative; otherwise, it is 
considered brave. By conducting a binary search over these bounds [6], we can 
identify the shortest distance from an obstacle for which the policy safely orders 
the robot to move FORWARD. This value’s inverse then serves as a bravery score 
for that policy. 


Design-for- Verification: Sliding Windows. A significant challenge that we 
faced in encoding our verification properties, especially those that pertain to 
multiple consecutive invocations of the DRL policy, had to do with the local 
nature of the sensor readings that serve as input to the DNN. Specifically, if 
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the robot is in some initial configuration that leads to a sensor input x, and 
then chooses to move forward and reaches a successor configuration in which the 
sensor input is x’, some connection between x and x’ must be expressed as part 
of the verification query (i.e., nearby obstacles that exist in x cannot suddenly 
vanish in x’). In the absence of a global map, this is difficult to enforce. 

In order to circumvent this difficulty, we used the sliding window princi- 
ple, which has proven quite useful in similar settings [6,21]. Intuitively, the 
idea is to focus on scenarios where the connections between x and x’ are par- 
ticularly straightforward to encode — in fact, most of the sensor information 
that appeared in x also appears in 2’. This approach allows us to encode mul- 
tistep queries, and is also beneficial in terms of performance: typically, adding 
sliding-window constraints reduces the search space explored by the verifier, and 
expedites solving the query. 

In the Turtlebot setting, this is achieved by selecting a robot configuration in 
which the angle between two neighboring lidar sensors is identical to the turning 
angle of the robot (in our case, 30°). This guarantees, for example, that if the 
central lidar sensor observes an obstacle at distance d and the robot chooses to 
turn RIGHT, then at the next step, the lidar sensor just to the left of the central 
sensor must detect the same obstacle, at the same distance d. More generally, 
if at time-step t the 7 lidar readings (from left to right) are (j1,...,/7) and the 
robot turns RIGHT, then at time-step t + 1 the 7 readings are (lz, l3,...,l7,1s), 
where only lg is a new reading. The case for a LEFT turn is symmetrical. By 
placing these constraints on consecutive states encountered by the robot, we 
were able to encode complex properties that involve multiple time-steps, e.g., as 
in the aforementioned infinite loops. An illustration appears in Fig. 3. 


5 Experimental Evaluation 


Next, we ran verification queries with the aforementioned properties, in order to 
assess the quality of our trained DRL policies. The results are reported below. 
In many cases, we discovered configurations in which the policies would cause 
the robot to collide or enter infinite loops; and we later validated the correctness 
of these results using a physical robot. We strongly encourage the reader to 
watch a short video clip that demonstrates some of these results [4]. Our code 
and benchmarks are also available online [3]. In our experiments, We used the 
Marabou verification engine [33] as our backend, although other engines could 
be used as well. For additional details regarding the experiments, we refer the 
reader to the full version of this paper [5]. 


Model Selection. In this set of experiments, we used verification to assess 
our trained models. Specifically, we used each of the three training algorithms 
(DDQN, Reinforce, PPO) to train 260 models, creating a total of 780 models. 
For each of these, we verified six properties of interest: three collision proper- 
ties (FORWARD COLLISION, LEFT COLLISION, RIGHT COLLISION), and three loop 
properties (ALTERNATING LOOP, LEFT CYCLE, RIGHT CYCLE), as described in Sec- 
tion 4. This gives a total of 4680 verification queries. We ran all queries with a 


Verifying Learning-Based Robotic Navigation Systems 619 


LEFT COLLISION|FORWARD COLLISION|RIGHT COLLISION 
Algorithm|SAT UNSAT SAT UNSAT SAT UNSAT 
DDQN 259 1 248 12 258 2 
Reinforce |255 5 254 6 252 8 
PPO 196 64 197 63 207 53 


ALTERNATING LOOP|LEFT CYCLE)RIGHT CYCLE] INSTABILITY 
Algorithmi SAT UNSAT SAT UNSAT |SAT UNSAT |# alternations 


DDQN 260 0 56 77 56 61 21 
Reinforce |145 115 5 185 |120 97 10 
PPO 214 45 26 198 | 30 198 1 


Table 1: Results of the policy verification queries. We verified six properties over 
each of the 260 models trained per algorithm; SAT indicates that the property 
was violated, whereas UNSAT indicates that it held (to reduce clutter, we omit 
TIMEOUT and FAIL results). The rightmost column reports the stability values of 
the various training methods. For the full results see [3]. 


TIMEOUT value of 12 hours and a MEMOUT limit of 2G; the results are summarized 
in Table 1. The single-step collision queries usually terminated within seconds, 
and the 2-step queries encoding an ALTERNATING LOOP usually terminated within 
minutes. The 12-step cycle queries, which are more complex, usually ran for a 
few hours. 9.6% of all queries hit the TIMEOUT limit (all from the 12-step cycle 
category), and none of the queries hit the MEMOUT limit.’ 

Our results exposed various differences between the trained models. Specif- 
ically, of the 780 models checked, 752 (over 96%) violated at least one of the 
single-step collision properties. These 752 collision-prone models include all 260 
DDQN-trained models, 256 Reinforce models, and 236 PPO models. Further- 
more, when we conducted a model filtering process based on all six properties 
(three collisions and three infinite loops), we discovered that 778 models out 
of the total of 780 (over 99.7%!) violated at least one property. The only two 
models that passed our filtering process were trained by the PPO algorithm. 

Further analyzing the results, we observed that PPO models tended to be 
safer to use than those trained by other algorithms: they usually had the fewest 
violations per property. However, there are cases in which PPO proved less suc- 
cessful. For example, our results indicate that PPO-trained models are more 
prone to enter an ALTERNATING LOOP than those trained by Reinforce. Specif- 
ically, 214 (82.3%) of the PPO models have entered this undesired state, com- 
pared to 145 (55.8%) of the Reinforce models. We also point out that, similarly 
to the case with collision properties, all DDQN models violated this property. 

Finally, when considering 12-step cycles (either LEFT CYCLE or RIGHT CYCLE), 
44.8% of the DDQN models entered such cycles, compared to 30.7% of the Rein- 
force models, and just 12.4% of the PPO models. In computing these results, we 


2 We note that two queries failed due to internal errors in Marabou. 
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computed the fraction of violations (SAT queries) out of the number of queries 
that did not time out or fail, and aggregated SAT results for both cycle directions. 

Interestingly, in some cases, we observed a bias toward violating a certain 
subcase of various properties. For example, in the case of entering full cycles — 
although 125 (out of 520) queries indicated that Reinforce-trained agents may 
enter a cycle in either direction, in 96% of these violations, the agent entered a 
RIGHT CYCLE. This bias is not present in models trained by the other algorithms, 
where the violations are roughly evenly divided between cycles in both directions. 

We find that our results demonstrate that different “black-box” algorithms 
generalize very differently with respect to various properties. In our setting, PPO 
produces the safest models, while DDQN tends to produce models with a higher 
number of violations. We note that this does not necessarily indicate that PPO- 
trained models perform better, but rather that they are more robust to corner 
cases. Using our filtering mechanism, it is possible to select the safest models 
among the available, seemingly equivalent candidates. 

Next, we used verification to compute the bravery score of the various models. 
Using a binary search, we computed for each model the minimal distance a dead- 
ahead obstacle needs to have for the robot to safely move forward. The search 
range was [0.18, 1] meters, and the optimal values were computed up to a 0.01 
precision (see the full version of this paper [5] for additional details). Almost all 
binary searches terminated within minutes, and none hit the TIMEOUT threshold. 

By first filtering the models based on their safe behavior, and then by their 
bravery scores, we are able to find the few models that are both safe (do not col- 
lide), and not overly conservative. These models tend to take efficient paths, and 
may come close to an obstacle, but without colliding with it. We also point out 
that over-conservativeness may significantly reduce the success rate in specific 
scenarios, such as cases in which the obstacle is close to the target. Specifically, 
of the only two models that survived the first filtering stage, one is considerably 
more conservative than the other — requiring the obstacle to be twice as distant 
as the other, braver, model requires it to be, before moving forward. 


Algorithm Stability Analysis. As part of our experiments, we used our 
method to assess the three training algorithms — DDQN, PPO, and Reinforce. 
Recall that we used each algorithm to train 52 families of 5 models each, in which 
the models from the same family are generated from the same random seed, but 
with a different number of training iterations. While all models obtained a high 
success rate, we wanted to check how often it occurred that a model success- 
fully learned to satisfy a desirable property after some training iterations, only 
to forget it after additional iterations. Specifically, we focused on the 12-step 
full-cycle properties (LEFT CYCLE and RIGHT CYCLE), and for each family of 5 
models checked whether some models satisfied the property while others did not. 

We define a family of models to be unstable in the case where a property holds 
in the family, but ceases to hold for another model from the same family with 
a higher number of training iterations. Intuitively, this means that the model 
“forgot” a desirable property as training progressed. The instability value of 
each algorithm type is defined to be the number of unstable 5-member families. 


Verifying Learning-Based Robotic Navigation Systems 621 


Although all three algorithms produced highly accurate models, they dis- 
played significant differences in the stability of their produced policies, as can 
be seen in the rightmost column of Table 1. Recall that we trained 52 families 
of models using each algorithm, and then tested their stability with respect to 
two properties (corresponding to the two full cycle types). Of these, the DDQN 
models display 21 unstable alternations — more than twice the number of al- 
terations demonstrated by Reinforce models (10), and significantly higher than 
the number of alternations observed among the PPO models (1). 

These results shed light on the nature of these training algorithms — indi- 
cating that DDQN is a significantly less stable training algorithm, compared to 
PPO and Reinforce. This is in line with previous observations in non-verification- 
related research [50], and is not surprising, as the primary objective of PPO is to 
limit the changes the optimizer performs between consecutive training iterations. 


Gradient-Based Methods. We also conducted a thorough comparison be- 
tween our verification-based approach and competing gradient-based methods. 
Although gradient-based attacks are extremely scalable, our results (summarized 
in [5]) show that they may miss many of the violations found by our complete, 
verification-based procedure. For example, when searching for collisions, our ap- 
proach discovered a total of 2126 SAT results, while the gradient-based method 
discovered only 1421 SAT results — a 33% decrease (!). In addition, given that 
gradient-based methods are unable to return UNSAT, they are also incapable 
of proving that a property always holds, and hence cannot formally guarantee 
the safety of a policy in question. Thus, performing model selection based on 
gradient-based methods could lead to skewed results. We refer the reader to the 
full version of this paper [5], in which we elaborate on gradient attacks and the 
experiments we ran, demonstrating the advantages of our approach for model 
selection, when compared to gradient-based methods. 


6 Related Work 


Due to the increasing popularity of DNNs, the formal methods community has 
put forward a plethora of tools and approaches for verifying DNN correctness 
(20, 24, 26, 28, 31-33, 36,39, 52,59]. Recently, the verification of systems involving 
multiple DNN invocations, as well as hybrid systems with DNN components, 
has been receiving significant attention [6,9,17,18,22,34,54,61]. Our work here 
is another step toward applying DNN verification techniques to additional, real- 
world systems and properties of interest. 

In the robotics domain, multiple approaches exist for increasing the reliability 
of learning-based systems [48,62,69]; however, these methods are mostly heuristic 
in nature [1,23,42]. To date, existing techniques rely mostly on Lagrangian mul- 
tipliers [38,49,53], and do not provide formal safety guarantees; rather, they op- 
timize the training in an attempt to learn the required policies [12]. Other, more 
formal approaches focus solely on the systems’ input-output relations [15,41], 
without considering multiple invocations of the agent and its interactions with 
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the environment. Thus, existing methods are not able to provide rigorous guar- 
antees regarding the correctness of multistep robotic systems, and do not take 
into account sequential decision making — which renders them insufficient for 
detecting various safety and liveness violations. 

Our approach is orthogonal and complementary to many existing safe DRL 
techniques. Reward reshaping and shielding techniques (e.g., [2]) improve safety 
by altering the training loop, but typically afford no formal guarantees. Our 
approach can be used to complement them, by selecting the most suitable policy 
from a pool of candidates, post-training. Guard rules and runtime shields are 
beneficial for preventing undesirable behavior of a DNN agent, but are sometimes 
less suited for specifying the desired actions it should take instead. In contrast, 
our approach allows selecting the optimal policy from a pool of candidates, 
without altering its decision-making. 


7 Conclusion 


Through the case study described in this paper, we demonstrate that current 
verification technology is applicable to real-world systems. We show this by ap- 
plying verification techniques for improving the navigation of DRL-based robotic 
systems. We demonstrate how off-the-shelf verification engines can be used to 
conduct effective model selection, as well as gain insights into the stability of 
state-of-the-art training algorithms. As far as we are aware, ours is the first work 
to demonstrate the use of formal verification techniques on multistep properties 
of actual, real-world robotic navigation platforms. We also believe the techniques 
developed here will allow the use of verification to improve additional multistep 
systems (autonomous vehicles, surgery-aiding robots, etc.), in which we can im- 
pose a transition function between subsequent steps. However, our approach is 
limited by DNN-verification technology, which we use as a black-box backend. As 
that technology becomes more scalable, so will our approach. Moving forward, 
we plan to generalize our work to richer environments — such as cases where 
a memory-enhanced agent interacts with moving objects, or even with multiple 
agents in the same arena, as well as running additional experiments with deeper 
networks, and more complex DRL systems. In addition, we see probabilistic 
verification of stochastic policies as interesting future work. 
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Abstract We present a new flow framework for separation logic reasoning about 
programs that manipulate general graphs. The framework overcomes problems in 
earlier developments: it is based on standard fixed point theory, guarantees least 
flows, rules out vanishing flows, and has an easy to understand notion of footprint 
as needed for soundness of the frame rule. In addition, we present algorithms for 
automating the frame rule, which we evaluate on graph updates extracted from 
linearizability proofs for concurrent data structures. The evaluation demonstrates 
that our algorithms help to automate key aspects of these proofs that have previ- 
ously relied on user guidance or heuristics. 
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1 Introduction 


The flow framework [23, 24] is an abstraction mechanism based on separation logic [5, 
32, 40] that enables reasoning about global inductive invariants of general graphs in 
a local manner. The framework has proved useful to verify intricate algorithms that 
are difficult to handle by other techniques, such as the Priority Inheritance Protocol, 
object-oriented design patterns, and complex concurrent data structures [22,24, 27,34]. 
However, these efforts have also exposed some rough corners in the underlying meta 
theory that either limit expressivity or automation. In this paper, we propose a new meta 
theory for the flow framework that aims to strike a balance between these conflicting 
requirements. In addition, we present algorithms that aid proof automation. 


Background. The central notion of the flow framework is that of a flow. Given a 
commutative monoid (M, +, 0) (e.g. natural numbers with addition), and a graph with 
nodes X and an edge function E: X x X — M — M, a flow is a function fl: X —> M 
that satisfies the flow equation: 


Vee X. f(z) = ins + Vyex Ewa AU) - 


That is, fl is a fixed point of the function that assigns every node x an initial value 
in, E M, its inflow, and then propagates these values through the graph according 
to the edge function. This is akin to a forward data flow analysis where the monoid 
operation + is used as the join. By choosing an appropriate flow monoid, inflow, and 
edge function, one can express inductive properties of graphs (reachability, sortedness, 
etc.) in terms of conditions that refer only to each node’s flow value f(x). 

A graph endowed with an inflow and associated flow is a flow graph. An example 
flow graph h is shown on the right-hand side of Fig. la. Here, the flow value fl(w) for 
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Figure 1. (a) Two flow graphs hı with nodes hı. X ={ x,y, z } (left) and ho with nodes 
ha.X = {r,u,v } (center) for the flow monoid of natural numbers with addition. The 
edge label A,q stands for the identity function. Omitted edges are labeled by the con- 
stant 0 function. Dashed edges represent the inflows. Nodes are labeled by their flow, 
respectively, outflow. The right side shows the composition h = hı * hz. (b) Two flow 
graphs hı with hy.X = {u,a2} (top) and hg with ho.X = {v,w} (bottom) whose 
composition is undefined due to vanishing flows. 


anode w counts the number of paths from r to w. A flow graph can be partial and have 
edges to nodes outside of X like the node u for hı in Fig. la. If we include these nodes 
in the computation of the flow, then their flow values constitute the outflow of the flow 
graph. For instance, the outflow of hı for uis 1. 

Flow graphs are equipped with a notion of disjoint composition, h = hı * hg. An 
example is given in Fig. la. The composition is only defined if the union of the flows 
of hı and hg is again a flow of h. This may not always be the case. For instance, the 
inflows and outflows of hı and ho may be mutually incompatible such as hı sending 
outflow 2 to u whereas the inflow to u in hg is only 1. 

Flow graph composition yields a separation algebra. That is, if we use flow graphs 
as an abstraction of program states (e.g., the heap), then we can use separation logic to 
reason locally about properties of programs that are expressed in terms of the induced 
flow graphs. For example, suppose the program updates the flow graph h in Fig. la 
to a new flow graph h’ by inserting a new edge labeled \;4 between the nodes r and 
u. This increases the flow of u and v from 1 to 2. We can break this update down as 
follows. First, we decompose h into hı and hg. Next, we obtain hå from hg by inserting 
the edge and updating the flow of u and v to 2. Finally, we compose hi again with 
hı to obtain h’. Note that the composition hı x hô is still defined. This means that any 
property expressed over the flow in the h,-portion of h still holds in h’. This is the 
well-known frame rule of separation logic, instantiated for flow graphs. 

The crux in applying the frame rule is to show that the composition hı * h4 is in- 
deed defined. One can do this locally by showing that the update hp ~ há is frame- 
preserving, i.e., for any hı such that hı * hg is defined, hı * hå is also defined. 

Typically, the flow subgraphs involved in a frame-preserving update ho ~> hé in- 
clude more nodes than those immediately affected by the update. For instance, consider 
the subgraphs of h and h’ in our example that consist only of the nodes {r, u} directly 
affected by inserting the edge. These subgraphs do not constitute a frame-preserving 
update because inserting the edge between r and u also changes the outflow to v from 
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1 to 2. Hence, the updated subgraph for {r, u} would no longer compose with the rest 
of h where v’s flow is still 1 instead of 2. We refer to a set of nodes such as {r, u, v} 
that identifies a frame-preserving update as the update’s footprint. 


Meta theories of flow graphs. In addition to ensuring that flow graph composition 
yields a separation algebra, there are two desiderata that one has to take into considera- 
tion when designing a meta theory of flow graphs: 
— Obtaining unique flows. When encoding inductive properties using flows, one is 
often interested in a particular flow, most commonly the least fixed point of the 
flow equation for a given inflow. One therefore needs a way to focus the reasoning 
on the particular flow of interest. 
— Identifying frame-preserving updates. In order to enable the application of the 
frame rule, one needs a way to effectively compute candidate footprints and check 
whether they identify frame-preserving updates. 
The first subgoal is crucial for expressivity and the second one for proof automation. 
Achieving one subgoals makes it more difficult to achieve the other. Specifically, con- 
sider the meta theory proposed in [24]. It requires that the flow monoid (M, +, 0) is also 
cancellative (m + nı = 0 and m + ng = o implies n = n2). Requiring cancellativity has 
the advantage that it is easy to check if an update h ~ h’ is frame-preserving: it suffices 
to show that h and h’ have the same inflow and outflow. Cancellativity also ensures that 
for each flow fl, there exists a unique inflow that produces fi. Hence, it is sufficient to 
track only fi since the inflow is a derived quantity. However, the converse does not hold. 

In fact, obtaining unique flows for cancellative M becomes more difficult. A natural 
requirement that one would like to impose on M is that the pre-order induced by + 
forms a complete partial order (cpo) or even a complete lattice. This way, one can focus 
on the least flow, which is guaranteed to exist if one applies standard fixed point theo- 
rems, imposing only mild assumptions on the edge functions. However, cancellativity 
is inherently incompatible with standard domain-theoretic prerequisites. For instance, 
the only ordered cancellative commutative monoid that is a directed cpo is the trivial 
one: Mo = {0}. Similarly, Mo is the only such monoid that has a greatest element. 

For cases where unique flows are desired, [24] imposes additional requirements on 
the edge functions (nil-potent) or the graph structure (effectively acyclic). The former is 
quite restrictive in terms of expressivity. The latter again complicates the computation 
of frame-preserving updates: one now has to ensure that no cycles are introduced when 
the updated graph h4 is composed with its frame hı. In fact, for the effectively acyclic 
case, [24] only provides a sufficient condition that a given footprint yields a frame- 
preserving update but it gives no algorithm for computing such a footprint. 


Contributions. In this paper, we propose a new meta theory of flows based on flow 
monoids that form w-cpos (but need not be cancellative). The cpo requirement yields 
the desired least fixed point semantics. The differences in the requirements on the flow 
monoid necessitate a new notion of flow graph composition. In particular, for a least 
fixed point semantics of flows, h = hı * he is only defined if the flows of hı and h2 do 
not vanish. An example of such a situation is shown in Fig. 1b, where the flows in hı 
and hz would vanish to 0 in hı * hg because the created cycle has no external inflow. 
Moreover, an update h ~> h’ is frame-preserving if h and h’ route inflows to outflows 
in the same way. We formalize this condition using a notion of contextual equivalence 
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of the graphs’ transfer functions, which are the least fixed points of the flow equation, 
parameterized by the inflows and restricted to the nodes outside the graphs. We then 
identify conditions on the edge functions that are commonly satisfied in practice and 
that allow us to effectively check contextual equivalence of transfer functions. This re- 
sult is remarkable because the flow monoid can have infinite ascending chains and the 
flow graphs can be cyclic. Building on this equivalence check, we propose an iterative 
algorithm for computing footprints of updates. This algorithm enables the automation 
of the frame rule for reasoning about programs manipulating flow graphs. We evalu- 
ate the presented algorithms on a benchmark suite of flow graph updates that are ex- 
tracted from linearizability proofs for concurrent search structures constructed by the 
tool plankton [26,27]. The evaluation demonstrates that our algorithms help to automate 
key aspects of these proofs that have previously relied on user guidance or heuristics. 


2 Flow Graph Separation Algebra 


We start with the presentation of our new separation algebra of flow graphs. 
Given a commutative monoid (M, +, 0), we define the binary relation < on M by 
n < m if there is o € M with m = n+ o. Flow values are drawn from a flow monoid, a 
commutative monoid for which the relation < is an w-cpo. That is, < is a partial order 
and every ascending chain K = mp < mı <...in M has a least upper bound, denoted 
|_| K. We expect n+| |AK=|_|(n+K). In the following, we fix a flow monoid (M, +, 0). 
Let ContFun(M — M) be the continuous functions in M — M. Recall that a 
function f : M — M is continuous [43] if it commutes with limits of ascending chains, 
F(U K) = L] f(4) for every chain K in M. We lift + and < to functions M — M in 
the expected way. An empty iterated sum `; <ø mj is defined to be 0. 
Lemma 1. (ContFun(M — M), 0, id) is a monoid. Moreover, if (M, <) is an w-cpo, 
so is (ContFun(M — M), <). 


A flow graph is a tuple h = (X, E, in) consisting of a finite set of nodes X C N, a 
set of edges E : X x N > ContFun(M — M) labeled by continuous functions, and 
an inflow in : (N\ X) x X — M. We use FG for the set of all flow graphs and denote 
the empty flow graph by hg £ (Ø, Ø, Ø). 

We define two derived functions for flow graphs. First, the flow is the least function 
flow : X — M satisfying the flow equation: flow(z) = in, + rhs,(flow), for all 
x € X. Here, in, & Žem x) în (y, x) is a monoid value and rhs; 4 DVyex Ety,2) 
is a function of type ContFun((X — M) — M). Finally, we also define the outflow 
out : X x (N\ X) > Mby out (z, y) = E(a,y)(flow(z)). 


Example 1. For linearizability proofs of concurrent search structures one can use a flow 
that labels every data structure node x with its inset, the set of keys k’ such that a thread 
searching for k’ may traverse the node x [22,23]. Translated to our setting, the relevant 
flow monoid is the powerset of keys, P(Z U { —o0, co }), with set union as addition. 
Figure 2 shows two keyset flow graphs that abstract potential states of a concurrent set 
implementation based on sorted linked lists. When a key k is removed from the set, 
the node z that stores k is first marked to indicate that x has been logically deleted. In 
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Xe 
Fae 20) Ae Aaa As mo poe oo) es Xs Pui 
at wat a? Re 
(3,00) (6,00) (6,00) (8,00) (3, 00) Ø (6,00) (8,00) 


Figure 2. Two flow graphs h; (left) and ho (right) with hı. X = h2.X = {1,t,r } for 
the keyset flow monoid P(Z U { —oo, oo }). The edge label Ax for a key k denotes the 
function Am. (m \ [—oo, k]). 


a second step, x is then physically unlinked from the list. The idea of the abstraction 
is that an edge leaving a node z that stores a key k is labeled by the function Ax if x 
is unmarked and otherwise by A—oo. This is because a search for k’ € Z will traverse 
the edge leaving x iff k < k’ or x is marked. In the figure, l and r are assumed to be 
unmarked, storing keys 6 and 8, respectively. Node ¢ is assumed to be marked. Flow 
graph hg is obtained from hı by physically unlinking the marked node t. Using the 
keyset flow one can then express the crucial data structure invariants that are needed 
for a linearizability proof based on local reasoning (e.g., the invariant that the logical 
contents of a node is always a subset of its inset). 

We note that the inflow of the global flow graph that abstracts the program state can 
be used in the specification. In the example, one lets in, = Z for the root r of the data 
structure and in, = Ø for all other nodes to indicate that all searches start at r. 


Composition without vanishing flows. To define the composition of flow graphs, 
hı * hg, we proceed in two steps. We first define an auxiliary composition that may suf- 
fer from vanishing flows, local flows that disappear in the composition. That is, this 
composition is defined for the flow graphs shown in Fig. 1b. In the composed graph the 
flow of each node is 0 where it was 1 before the composition—the flow vanishes. This 
means that the auxiliary composition does not allow to lift lower bounds on the flow val- 
ues from the individual components to the composed graph. Hence, the actual compo- 
sition restricts the auxiliary composition to rule out such vanishing flows. Definedness 
of the auxiliary composition requires disjointness of the nodes in hı and hg. Moreover, 
the outflow of one flow graph has to match the inflow expectations of the other: 


h ##ħ if XiNX2=SG A Vr E€ My, y € X2. outs (x, y) = ina(z, y) A 
outa(y, £) = inı(y, x). 
The auxiliary composition hı © ho removes the inflow provided by the other component: 
hy Bhp = (X,W Xp, By W By, (ing © ina) (xwx) x(u) - 
To rule out vanishing flows, we incorporate a suitable equality on the flows: 
h #h if h##h A hy.flow W ho. flow = (hi W hg).flow . 


Only if the latter equality holds, do we have the composition hı * hg £ h W hy. It is 
worth noting that h;.flow W hp.flow > (hı © h2).flow always holds. What definedness 
really asks for is the reverse inequality. 

Recall from [5] that a separation algebra is a partial commutative monoid (X, x, emp) 
with a set of units emp C X. 


Lemma 2. (FG, *, { hg }) is a separation algebra. 
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3 Frame-Preserving Updates 


Since flow graphs form a separation algebra, we can use separation logic assertions 
to describe sets of flow graphs as in [24] and then use them to prove separation logic 
Hoare triples. A key proof rule used in such proofs is the frame rule. Given separation 
logic assertions P; and P>, and a command c, the frame rule states: if the Hoare triple 
{Pi} c{P2} is valid, then so is { P; x» F} c{ P * F} for any frame F. The remainder of 
the paper focuses on developing algorithms for automating this proof rule. 

The flow graphs described by an assertion may have unbounded size (e.g., due to 
the use of iterated separating conjunctions). We only consider bounded flow graphs in 
the following; the unbounded case is known to be a challenge for which orthogonal 
techniques are being developed (cf. Sect. 6). However, even if the flow graphs have 
bounded size, there may still be infinitely many of them because the inflows and edge 
functions are encoded symbolically in a logical theory of the flow monoid. For peda- 
gogy, we present our algorithms in terms of concrete flow graphs rather than symbolic 
ones. However, our development readily extends to symbolic representations assuming 
the underlying flow monoid theory is decidable. In fact, our implementation discussed 
in Sect. 5 works with symbolic flow graphs. 

The soundness of the frame rule relies on the assumption that the state update in- 
duced by the command c satisfies a certain locality condition. In our setting, this condi- 
tion amounts to checking that the update of P, under c is frame-preserving with respect 
to flow graph composition. For the flow graphs hı described by P, and all flow graphs 
hg in the post image of hı under c, this means that hı # h implies hz # h for all h. 
Intuitively, ho # h still holds if hı and hg transfer inflows to outflows in the same way. 

Formally, for a flow graph h we define its transfer function tf (h) mapping inflows 
to outflows, tf (h) : ((N\ X) x X >M) > X x (N\ X) — M, by 


if (h)(in’) £ h[in  in’].out . 


For a given inflow in, we also write tf(hi) =in tf(h2) to mean that for all inflows 
in! < in, tf(hi)(in’) = tf (h2)(in’). 


Definition 1. Flow graphs hı, ho are contextually equivalent, denoted hy =ctz ho, if 
we have hy.X = hg.X, hy.in = hg.in, and tf (hi) =m in tf (he). 


Theorem 1 (Frame Preservation). For all flow graphs hy =ctx hg and h, hı #h if 
and only if hz # h and, in case of definedness, hy xh =ctz hg * h. 


To automate the frame rule for a command c and a precondition P, we need to 
identify a decomposition P = P; * F so as to infer {P,}c{P2} and then apply the 
frame rule to derive {P} c {Q} for the postcondition Q = P, * F. This is closely related 
to the frame inference problem [4]. When a command modifies a flow graph hı to hg, 
our goal is to identify a (hopefully small) set of nodes Y in h; that are affected by this 
update, the flow footprint. That is, Y captures the difference between the flow graphs 
before and after the update and the complement of Y defines the frame. To make this 
formal, we need the restriction of flow graphs to subsets of nodes, which then gives us 
a notion of flow graph decomposition. Towards this, consider h and Y C N. We define 


hly 4 (AX Y,h.E|(a.xny)xns in) 
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such that the inflow in satisfies in(z, y) = h.in(z, y) forall z E€ N\h.X, y Eh XY 
and in(£, y) Ê h.E(n,y)(h.flow(«)) for alla Eh.X\Y,yehnxny. 


Definition 2. Consider hı and hz with X £ hy.X = ho.X and hy.in = hy.in. A 
flow footprint for the difference between h; and hg is a subset of nodes Y C X so that 
Maly =cte h2|y and hy|x\y = h2|x\ y. The set of all such footprints is FFP (hy, h2). 


Flow graphs over different sets of nodes or inflows never have a flow footprint. The 
former requirement merely simplifies the presentation. To that end, we assume that all 
nodes that will be allocated during program execution are already present in the initial 
flow graph. This assumption can be lifted. The latter requirement is motivated by the 
fact that the global inflow is part of the specification as noted earlier in Example 1. 

Before we proceed with the problem of how to compute flow footprints, we high- 
light some of their properties. 


Lemma 3 (Footprint Monotonicity). If Z € FFP(h,, hz) and Z C Y C h,.X, then 
Y € FFP(hy, hg). 


A consequence of monotonicity is the existence of a canonical flow footprint: if 
there is a flow footprint at all, then the set of all nodes will work as a footprint. Of 
course this canonical footprint is undesirably large. It corresponds to the case where 
one reasons about flow graph updates globally, forgoing the application of the frame 
rule. Unfortunately, an inclusion-minimal flow footprint does not exist. 


Proposition 1 (Canonical Footprints). We have: FFP(h,, h2) 4 Ø if and only if 
hy.X € FFP(hy, hz). There is no inclusion-minimal flow footprint; in particular, the 
set FFP(hy, hz) is not closed under intersection. 


The proof of monotonicity requires a better understanding of the restriction opera- 
tor, as provided by the following lemma. 


Lemma 4 (Restriction). Consider h and Y, Z CN. Then (i) h|y.flow = h.flow|y, 
(ii) hly #hixvy and hly * Alx\y = h, and (iii) (hly)|z = h| yaz. 

Since flow footprints are defined via restriction, the lemma also shows that flow 
footprints are well-behaved. For example, the restriction to the footprint Y does not 
change the flow of a node y € Y nor that of a node z € h. X \ Y. More formally, this 
means h| y.flow(y) = h.flow(y) and h|x\ y.flow(x) = h.flow(x), by Lemma 4(i). 

For our development, it will be convenient to have a more operational formulation 
of the transfer function. Towards this, we understand the flow graph as a function that 
takes an inflow as a parameter and yields a transformer of flow approximants: 


h: (N\ X)x X —>M)> (X > M)>~ xX > M 
defined by hlin](o)(“) = ing + rhsz(c) . 


Recall in, = Dyen\x in(y, £) and rhsz(o) = di ex Ewx) (o (y)). The least fixed 
point of h[in] is | |;en hlin] (L) with A? = idx_,y and h't! = h’ o h, by Kleene’s 
theorem. Define out : (X >M) —> X x (N\ X)+M by out(o)(y, z) = Ey,2)(o(y)). 
This yields the following characterization of transfer functions and flows. 

Lemma 5 (Transfer). For all flow graphs h we have (i) tf (h) = out o (Ifp.h[—]) and 
(ii) Ufp.h[h.in]) = h.flow. 
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We present an algorithm for computing a footprint for the difference between two given 
flow graphs. We proceed in two steps. We first give a high-level description of the 
algorithm that ignores computability problems. In a second step, we show how to solve 
the computability problems. Throughout the development, we will assume to have flow 
graphs hı and hz over the same nodes X £ hy.X = hy.X and with the same inflow 
hı.in = hg.in. If this assumption fails, a flow footprint does not exist by definition. 


4.1 Algorithm 


We compute the flow footprint as a fixed point. We start with the footprint candidate 
Z consisting of the nodes whose outgoing edges differ in h, and hy. Then, we itera- 
tively add the nodes whose outflow leaving the current footprint candidate Z differs in 
hi|z and h2|z. That the outflow differs means that the transfer functions tf (hı| z) and 
tf (h2| z ) differ and thus the candidate Z is not a footprint. In turn, if all outflows match, 
the transfer functions coincide and Z is a footprint as desired. 

Technically, we compute the fixed point over the powerset lattice of nodes endowed 
with a distinguished top element: (P(X)', C) with P(X)! £ P(X) w {T }. Element 
T indicates a failure of the footprint computation. This may arise if the footprint is not 
covered by X, i.e., extends beyond the flow graphs hy, ho. 

Our fixed point computation starts from Z = odif n, © X as defined by 


odifm n = {2 EX | 3z €N.hy.E (a, 2) £ ho.E(z,z)}. 


The fixed point then proceeds to extend Z as long as the transfer functions associated 
with h,|z and ho|z do not match. To define the extension, we let the transfer failure of 
Z C X be the successor nodes of Z that may receive different outflow from h; and hg: 


tfail,, n (Z) = fz EN\Z 


Jin <hy|z.in 42 € Z. 
[tf (hilz)(in)](z, £) A [tf (helz)(én)](z, 2) 


This set is the reason why the current footprint candidate Z is not a footprint, that is, 
Z ¢ FFP(hy, h2). Extending Z with the transfer failure yields a new candidate. We 
check that the new candidate is covered by X (i.e., does not include nodes outside of 
hy, hg). If the check fails, the new candidate is { T } to indicate that no footprint could 
be computed. The following definition makes the extension procedure precise. 


Definition 3. The function exty,.n, : P(X)! + P(X)! is defined by 
extn, h (Z) = tfail,, n(Z)ZX 2? T: ZU odif,, n U tfail,, n (Z). 


Iteratively extending the candidate Z with the transfer failure eventually produces a 
footprint for the difference of hı and hg, or fails with T. The approach is sound. 


Theorem 2 (Soundness). Let F = Ifp.extn, n If FAT, then F € FFP(hy, hz). 
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hlz /h'| z h| z /h'|z Ala h' |Z 


H1 


Figure 3. Computing a footprint for the difference of h and h’ iterates through the sets 
A A 


Z £ {r}, Z £ {r,u}, and Z £ {r,u,v}. The latter is the least fixed point of 
extn w and a footprint as desired, Z2 E€ FFP(h, h’). 


Example 2. For an illustration consider Fig. 3. There, we apply the fixed point compu- 
tation to find a footprint for the difference of h and h’. As alluded to in Sect. 1, h’ is the 
result of inserting into h a new edge between nodes r and u labeled with A;a. 

The fixed point computation starts from Z {r} = odif p p as it is the only 
node whose outgoing edges have changed. Next, we compute tfail h, w (Zo). This yields 
{ u } because u receives 0 from Zo in h but 1 in h’ due to the new edge. The outflow 
from Zp to the remaining nodes coincides in h and h’. Hence, the extension of Zo 
with the transfer failure yields Z1 = extn.,:(Zo) = {u,r }. Similarly, we compute 
tfailn w (Z1) and obtain Z = eztn n (Z1) = { r, u,v }. Since v has no outgoing edges, 
Zo is the least fixed point of ext», n’. Because Zp is a subset of the nodes of h and h’, it 
is a footprint, Z2 € FFP(h, h’). 


To obtain Theorem 2, we have to prove that the fixed point F £ Ifp.ext Riha 18 
indeed a footprint if F # T. That is, we have to establish the following two properties 
according to Definition 2: (i) hı|F =cte h2|r and Gi) hi|x\r = h| x\ r- 

To see the latter one, note that the graph structures (the nodes and edges) of hı | x\ p 
and hə|x\p coincide because odif;,, n, C F. The inflows coincide as well because 
they are, intuitively, comprised of the flow graph’s overall inflow h1.in = hg.in and the 
outflow of the footprint, which is equal in both flow graphs due to hile =cix hel r- 

The interesting part of the soundness proof is to establish property (i), the contex- 
tual equivalence hy|r =ctz h2|r. Since F is a fixed point of extn, ha, we know that 
tfail),, h (Z) = Ø and thus the transfer functions of hı|p and h| p coincide. Hence, 
it suffices to establish hı|p.in = ho|p.in to obtain the desired contextual equivalence, 
Definition 1. This key step in the proof is obtained with the help of the following lemma. 


Lemma 6. Let odif p, p, CF CX with tfail,, ),,(F)=2. Then hy| p.in = ho| pin. 


To establish the lemma one has to show that the inflow into F from the non-footprint 
part Y & X\F coincides in hı and hg. The challenge is a cyclic dependency in the flow: 
the inflow from Y depends on the outflow of F, which depends on the inflow from Y. 
To tackle this, we rephrase the flow equation for h; as a pairing of the two separate flow 
equations for h;| and h;|y, for i € {1,2}. Intuitively, the pairings compute the flow 
locally in h;| and h;|y for a fixed inflow (initially h;.in). Then, the inflow to h;| 
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he|z 


Figure 4. Counterexample to completeness using the monoid (NU{o0o}, max, 0). While 
the set {x, y, z, u} is a footprint for the difference between flow graphs hı and hg, our 
fixed point will produce the candidates {x} and Z = {x, y, z} and then fail with {T}. 


is updated to the inflow from outside h; and the inflow from h;|y, and similarly for 
the inflow to h;| y. This is repeated until a fixed point is reach. Technically, we rely on 
Bekié’s Lemma [1] to compute the pairings. Then, we observe tf (hi|r) = tf (hal) 
because tfail),, n, (F) = Ø as well as tf(hily) = tf(he|y) because odif n n, C F. 
Roughly, this means that the flow pairings for hı and hy must coincide as the individual 
parts propagate the same values. Put differently, the updated inflow for hi|- and ho| p 
as well as hı |y and hg| y coincide in each iteration. Overall, we get hı |p.in = ho| pin. 

Our computation of a flow footprint is forward, it starts from the nodes where the 
flow graphs differ and follows the edges. It may therefore fail if predecessor nodes of 
an iterate Z need to be considered to determine a flow footprint. For an example refer to 
Fig. 4. Using the monoid (NU{oo}, max, 0), it is easy to see that the set { x, y, z, u }is a 
footprint for the difference between hı and hg. Our fixed point, however, will start with 
{ x} and extend this to Z = { x,y, z }. Let v be the node outside the flow graphs that y 
is pointing to. Then, the next transfer failure is tfail),, n, (Z) = {v } because for in < k 
the outflow of y to v differs in hı |z and h2|z. Our approach fails to compute a footprint. 


Fact 3 (Incompleteness) There are flow graphs hı and hg for which our algorithm is 
not able to determine a flow footprint although one exists. 


4.2 Comparing Transfer Functions 


When implementing the above fixed point computation, the challenge is to prove the 
equivalence between given transfer functions in order to obtain the transfer failure: 
[tf (hil z)(—)](—, £) = [tf (hal z)(—)](—, x)? Already the comparison of two functions 
is known to be difficult to do algorithmically. What adds to the problem is that trans- 
fer functions are defined as least fixed points, meaning we do not have a closed-form 
representation of the functions to compare. 

Our approach is to impose additional requirements on the set of edge functions. The 
requirements are met in all our experiments, and so do not mean a limitation for the ap- 
plicability of our approach. We show that if the edge functions are not only continuous 
but also distributive, then the transfer functions can be understood in terms of paths 
through the underlying flow graphs. If the edge functions are additionally decreasing 
and the underlying monoid’s addition is idempotent, then acyclic paths are sufficient. 
Both results do not hold for merely continuous edge functions. 
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Distributivity. Our first additional assumption is that the edge functions f : M —> M 
are not only continuous, but also distributive in that f(m + n) = f(m) + f(n) for all 
m,n € Mand f(0) = 0. We use DistF'un(M) to refer to the set of all continuous and 
distributive functions over M. The properties formulated in Lemma 1 carry over. 

For continuous and distributive transfer functions, we can understand h{in]' in terms 
of the paths through h[in] of length i. For example, i = 3 yields 


Ain ILe) = ime + SD Egal iny + SD Bey (ine + SD Ea (Llu)) ) 
yEX rex ue x 
= inz; + 5 Evy,z) (iny) + 5 5 Evy,z)(E(a,y) (inz)) : 
yEex yEX rex 


The first equality is by definition, the second is where distributivity comes in. In partic- 
ular, 1(u) = 0 and so Evy, )( Eve,y)( E~ujx)( (u) ) ) = 0. The last term shows that 
we forward the inflow given at a node z to an intermediary node y and from there to 
the node z of interest. For higher powers of h[in], we take longer paths. For h[in]*, we 
thus obtain the sum over all nodes x and all paths from z to z through the flow graph. 
We need some definitions to make this precise. 

A path p through flow graph h is a finite, non-empty sequence of nodes all of which 
belong to the flow graph except the last which lies outside: 


p = Tor... Inz E X*-(N\X) 


where - denotes path concatenation. We use first(p) = 2% resp. last(p) = £n to extract 
the first resp. last node from within the flow graph h. By Paths(h, x,y,z) we denote 
the set of all paths through flow graph h that start in node first(p) = x and leave h 
from node last(p) = y to move to z € N \ X. Given a set of nodes X’ C X, we use 
Paths(h, X’, y, z) for the union over all x € X’ of the sets Paths(h, x, y, z). The path 
induces the function Ep : M — M that composes the edge functions along the path: 


Ey = id Exp = Ep © Eva, first(p)) i 


Together with Lemma 5, the above analysis yields the first closed-form representation 
of a flow graph’s transfer function, which so far has involved a fixed point computation. 


Theorem 4 (Closed-Form Representation). If h is labeled over DistFun(M), then: 
[F (a) (in) ](y, 2) = aes > pë Paths(h,£,y,z) E, (ins) : 


Theorem 4 pushes the fixed point computation of transfer functions into the sets 
Paths(h, x,y,z) which are themselves defined inductively and potentially infinite. In 
the following, we alleviate this problem without requiring acyclicity of the flow graph. 


Idempotence. Our second assumption is that addition in the monoid is idempotent, 
meaning m + m = m for all m € M. Idempotence ensures the addition degenerates to 
a join for comparable elements: m+n=mUn=n for all m < n € M. Unless stated 
otherwise, we hereafter assume an idempotent addition. 

With Theorem 4, it remains to compare sums over paths. With idempotence, we 
show that we can further reduce the problem and reason over single paths rather than 
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sums. We show that every path in hı can be replaced by a set of paths in hg, and vice 
versa. Even more, we only have to consider the paths from nodes where the edges 
changed. The precise formulation of the path replacement condition is the following. 


Definition 4. The path replacement condition for flow graphs h, by hz over the same 
set of nodes X and labeled by DistDecFun(M) requires that for every xz € odif hhz 
for every y € X, and for every z € N\ X we have 


Vp € Paths(hi, x,y,z) 3P C Paths(ho,2,y,2). Ep < Ep = er Eq - 


Example 3. For the flow graphs hı and hg from Fig. 4, we have path replacement of 
hı by ho, and vice versa. To see this, consider the path p = x - z - u - y - v inh, and 
q Ê x- y -v in hg, where v is the node outside of h1, hz that y points to. Since all edges 
are labeled with A;q, we have Ep = Aia = Eg. It is worth noting that, in this example, 
we can ignore the cycles in hı and h2. In a moment, we will introduce restrictions on 
edge functions in order to do avoid cycles in general. 

Similarly, we have path replacement for the flow graphs from Fig. 2. To be precise, 
Ep = às = E; for the paths p ê L- t- r- vin hy andg=1-r-vinhg. 


The main result is that path replacement is sound and complete for proving equiva- 
lence of transfer functions. 


Theorem 5 (Path Replacement Principle). We have tf (h1) = tf (h2) if and only if 
path replacement of hı by hz and of hz by hy hold. 


The theorem is remarkable in several respects. First, one would expect we have 
to replace the paths from all nodes in hı. Instead, we can focus on the nodes where 
the outgoing edges changed. Second, one would expect the replacing paths P start 
from arbitrary nodes in hg. Such a set of paths would yield a transfer function of type 
(Y —M)-—M. Instead, we can work with a function of type M—M. Even more, we 
can focus on paths starting in the same node as the path we intend to replace. Finally, the 
paths we use for replacement come without any constraints, leaving room for heuristics. 

The proof starts from a full path replacement condition of h, by ha, both over X and 
labeled by DistFun(M). Full path replacement coincides with Definition 4 but draws x 
from full X rather than x € odif,, ;,,- Full path replacement characterizes equivalence 
of the transfer functions in a monoid with idempotent addition in the case of continuous 
and distributive edge functions. 


Lemma 7. Full path replacement of hı by hz and hg by hy hold iff tf (hi) = tf (h2). 


The result is a consequence of Theorem 4, which equates tf (h1) with the sum of the 
Ep for all paths p € Paths(hi, £, y, z) for all z € X. Full path replacement allows us to 
sum over Ep instead, for some P C Paths(ha, x, y, z). Over-approximating P with all 
paths Paths(ho, x, y, z), we obtain an upper bound for tf (h4). It is easy to see that the 
resulting sum can be rewritten into the form of Theorem 4, yielding tf (h1) < tf (h2). 
Analogously, we get tf (h1) > tf (h2) and thus tf (h1) = tf (h2) as required. The reverse 
direction of the lemma is similar. 

To conclude the proof of the path replacement principle in Theorem 5, we show that 
full path replacement and (ordinary) path replacement of hı by hz coincide. To see this, 
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consider a path p € Paths(h,, x, y, z) for any x € X. The goal is to show Ep < Ep for 
some P € Paths(hg, x, y, 2). To that end, decompose the path into p = pı -p2 such that 
a’ = first(p2) is the first node in p from odif p, n,- Ordinary path replacement yields 
Q € Paths(h2, 2’, y, z) with Ep, < Eg. Now, choose P = { pi-¢ | q € Q }. Because 
pı exists in hı and hz with the exact same edge labels, we obtain the desired Ep < Ep. 


Lemma 8. Full path replacement of hı by hz holds if and only if path replacement of 
hy by hg holds. 


Decreasingness. We assume that the edge functions f : M — M are not only continu- 
ous and distributive, but also decreasing: f(m) < m for all m € M. The assumption of 
decreasing edge functions is justified by the fact that a program that traverses the flow 
graph builds up information about the status of the structure, and smaller flow values 
mean more information (as in classical data flow analysis). We use DistDecFun(M) to 
refer to the set of all continuous, distributive, and decreasing transfer functions over M; 
Lemma | carries over to this set. Addition in the monoid is still assumed idempotent. 

If all edge functions are decreasing, every cycle in the flow graph is decreasing as 
well. The key observation is that, given an idempotent addition, cycles with decreasing 
edge functions can be avoided when forming sums over sets of paths. 


Lemma 9. Let h be labeled over DistDecFun(M) and pı - p- pa E€ Paths(h, x, y, z) 
with last(p) = first(p). Then pı - p2 € Paths(h, £, y, z) and Ep, -p-py < Epy-po- 


Call a path simple if it does not repeat a node and let SimplePaths(h, x, y, z) denote 
the set of all simple paths through h from z to y and leaving the flow graph towards z. 
Note that a finite graph only admits finitely many simple paths. 


Theorem 6 (Simple Paths). Assuming continuous, distributive, and decreasing edge 
functions, and assuming idempotent addition, Theorem 4 and Theorem 5 hold with every 
occurrency of Paths(h, x, y, z) replaced by SimplePaths(h, x, y, z). 


In practice, path-counting flows, keyset flows, reachability flows, shortest-path flows, 
and priority inheritance flows are relevant [22—24, 27] and compatible with our theory. 


5 Evaluation 


We substantiate the practicality of our new approach by evaluating it on a real-world 
collection of flow graphs extracted from the literature. We explain how we obtained our 
benchmarks and how we implemented and evaluated our approach. 


Benchmark Suite. As alluded to in Sect. 1, the flow framework has been used to 
verify complex concurrent data structures. More specifically, it has been used for auto- 
mated proof construction by the plankton tool [26,27]. plankton performs an exhaus- 
tive proof search over a separation logic with support for flows—and further advanced 
features for establishing linearizability that do not matter for the present evaluation. 
In order to handle heap updates, plankton generates a footprint h for the flow graph 
hı = h * hframe Of the current proof state (represented as an assertion in separation 
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logic). It then frames the non-footprint part hframe of the flow graph hı to compute the 
post state h’ of the heap update locally for the footprint h. The result is the new flow 
graph hz = h’ x hframe. We consider the pair (hy, h2) a benchmark for our evaluation. 

We adapt plankton to export the flow graph pairs for which a footprint is con- 
structed. This way, we obtain 1272 benchmarks from the heap updates occurring during 
proof construction for a collection of 10 concurrent set data structures. All flow graphs 
in this benchmark suite contain at most 4 nodes. 

Our benchmark suite is limited by the capabilities and restrictions of plankton. In 
particular, we inherit the confinement to concurrent search structures. This is due to 
the fact that plankton integrates support only for the keyset flow (cf. Example 1). Our 
evaluation will compute footprints with respect to this flow. 


Implementation. We implement the fixed point computation to find footprints for two 
given flow graphs hı, he from Sect. 4 in a tool called krill [28]. It integrates three 
methods for computing the transfer failure tfa7l;,, n, (Z) of a footprint candidate Z: 

1. NAIVE: A naive method that computes the flow within the footprint Z. Following 
[24], we require acyclicity of flow graphs for this method to avoid solving a fixed 
point equation when computing the flow. 

2. NEW: Our new approach leveraging the path replacement condition (cf. Theorem 5) 
for simple paths (cf. Theorem 6). This method requires distributive and decreasing 
edge functions as well as idempotent addition in the underlying monoid. 

3. DIST: A variation of our new approach leveraging the closed-form representation 
(cf. Theorem 4). We require distributive edge functions and acyclicity of the flow 
graphs to avoid an unbounded sum over all paths in the closed-form representation. 

Our benchmark suite satisfies the requirements for all three methods. The NAIVE and 
DIST methods include a (sufficient) check to ensure acyclicity in the updated flow graph 
to guarantee soundness of the resulting footprint. 

All three methods encode the necessary equivalence checks among transfer func- 
tions as SMT formulas which are then discharged using the off-the-shelf SMT solver 
Z3 [31]. Our encodings use the theory of integers with quantifiers. The NAIVE method 
additionally uses free functions to encode sets of integers. 


Experiments. We ran krill on our benchmark suite and compared the runtime of the 
three different methods for computing the transfer failure. Our results are summarized 
in Fig. 5(left). For every search structure that we extracted benchmarks from, the figure 
lists: (i) the number #FG of flow graph pairs extracted, (11) each method’s total runtime 
for computing the footprints of all flow graph pairs, and (iii) the speedup of NEW over 
NAIVE in percent. The experiments were conducted on an Apple M1 Pro. 

Figure 5(left) shows that the runtime for all methods is roughly linear in the number 
of computed footprints. Moreover, the absolute time for computing footprints is small, 
making the approaches practical. The figure also shows that our NEW and DIST methods 
have a performance advantage over the NAIVE method. The NEW method is between 
22% and 39% faster than the NAIVE method. We believe that the difference is relatively 
small only because the acyclicity assumption avoids a potentially non-terminating fixed 
point computation. Avoiding this fixed point in the presence of cycles is a major ad- 
vantage that our NEW method has over the NAIVE and DIST methods. The performance 
difference for DIST and NEW are negligible because the acyclicity check is negligible. 
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Structure #FG NAIVE DIST NEW Speedup El NAIVE A DIST N NEW 
Fine set [13] 12 75ms 48ms 46ms 39% 

Lazy set [12] 14 73ms 52ms 5lms 30% 

ORVYY set [33] 20 106ms 76ms  74ms 30% 

VY DCAS set [46] 19 109ms 74ms 73 ms 33% 

VY CAS set [46] 28 139ms 104ms 102ms 27% 

Michael set [29] 225 1216ms 887ms 874ms 28% 

Michael set (wait-free) 186 996ms 73lms 721ms 27% 

Harris set [11] 352 2242ms 1490ms 1443ms 36% 

Harris set (wait-free) 296 1859ms 1242ms 1205ms 35% 

FEMRS tree [10] 120 519ms 409ms 407ms 22% 

Total 1272 7335ms 5114ms 4996ms 32% er 


Figure 5. Experimental results averaged over 1000 repeated runs, conducted on an Ap- 
ple M1 Pro. (left) Total runtime for computing footprints for flow graphs occurring dur- 
ing automated proof construction for highly concurrent set data structures. The speedup 
gives the relative performance improvement of NEW over NAIVE. (right) Average run- 
time for computing a single footprint, partitioned by footprint size (T indicates failure). 


We also factorized the runtimes of our benchmarks along the size of the resulting 
footprint. Figure 5(right) gives the average runtime and standard deviation for comput- 
ing a single footprint, broken down by footprint size. If no footprint could be found, its 
size is listed as T. These failed footprint constructions are consistent with plankton’s 
method and would not lead to verification failure. 


6 Related Work 


Two alternative meta theories for the flow framework have been proposed in prior 
work [23, 24]. Like in our setup, the original flow framework [23] demands that the 
flow domain is an w-cpo to obtain a least fixed point semantics. However, it proposes a 
different flow graph composition that leads to a notion of contextual equivalence relying 
on inflow equivalence classes. This complicates proof automation. In addition, the flow 
domain is assumed to be a semiring and edge functions are restricted to multiplication 
with a constant. This limits expressivity. 

As discussed in Sect. 1, the revised flow framework proposed in [24] requires that 
the flow monoid is cancellative but not an w-cpo. This means that uniqueness of flows is 
not guaranteed per se. Instead, uniqueness is obtained by imposing additional conditions 
on the edge functions. However, these conditions are more restrictive than those im- 
posed in our framework. The capacity of a flow graph introduced in [24] closely relates 
to our notion of transfer function. A closed-form representation based on sums over 
paths is used to check equivalence of capacities. However, this reasoning is restricted 
to acyclic graphs. Also, [24] provides no algorithm for computing flow footprints. 

In a sense, our work strikes a balance between the two prior meta theories by guar- 
anteeing unique flows without sacrificing expressivity and, at the same time, enabling 
better proof automation. That said, we believe that the framework proposed in [24] re- 
mains of independent interest, in particular if the application does not require unique 
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flows (i.e., does not impose lower bounds on flows that may trivially hold in the pres- 
ence of vanishing flows). Cancellativity allows one to aggregate inflows and outflows 
to unary functions, which can lead to smaller flow footprints (i.e., more local proofs). 


The benchmark suite for our evaluation is obtained from plankton [26,27], a tool for 
verifying concurrent search structures using keyset flows. When the program mutates 
the symbolic heap, plankton creates a flow graph for the mutated nodes plus all nodes 
with a distance of k or less from those nodes. This flow graph is considered to be the 
footprint and contextual equivalence is checked. The check is basically the same as 
for NAIVE. However, the paper does not present the meta theory for the underlying 
notion of flow graphs, nor does it provide any justification for the correctness of the 
implemented algorithms used to reason about flow graphs. 


Flow graphs form a separation algebra. Hence, the developed theory can be used 
in combination with any existing separation logic that is parametric in the underly- 
ing separation algebra such as [5, 7, 18, 27,41, 44]. Identifying footprints of updates 
relates to the frame inference problem in separation logic, which has been studied ex- 
tensively [4, 6, 15, 25, 35, 36, 42]. However, existing work focuses on frame inference 
for assertions that are expressed in terms of inductive predicates. These techniques are 
not well-suited for reasoning about programs manipulating general graphs, including 
overlayed structures, which are often used in practice and easily expressed using flows. 
A common approach to reason about general heap graphs in separation logic is to use 
iterated separating conjunction [14, 39,44, 47] to abstract the heap by a pure graph that 
does not depend on the program state. Though, the verification of specifications that 
rely on inductive properties of the pure graph then resorts back to classical first-order 
reasoning and is difficult to automate. An exception is [45] which uses SMT solvers to 
frame binary reachability relations in graphs that are described by iterated separating 
conjunctions. However, the technique is restricted to such reachability properties only. 


Unbounded footprints have been encountered early on when computing the post im- 
age for recursive predicates [8]. This has spawned interest in separation logic fragments 
for which the reasoning can be efficiently automated [2,3,9, 17,20,35,38]. A limitation 
that underlies all these works is an assumption of tree-regularity of the heap, in one way 
or another, which flows have been designed to overcome. In cases where the program 
(or ghost code) traverses the unbounded footprint (before or after the update), recent 
works [24,27] have found a way to reduce the reasoning to bounded footprint chunks. 


The definition of a flow closely resembles the classical formulation of a forward 
data flow analysis. The fact that the least fixed point of the flow equation for distributive 
edge functions can be characterized as a join over all paths in the flow graph mirrors dual 
results for greatest fixed points in data flow analysis [19,21]. In a similar vein, the notion 
of contextual equivalence of flow graphs relates to contextual program equivalence and 
fully abstract models in denotational semantics [16, 30,37]. In fact, Beki¢’s Lemma [1], 
which we use in the proofs of Theorem | and lemma 6, was originally motivated by the 
study of such models. Flow graphs can serve as abstractions of programs (rather than 
just program states). We therefore believe that our results could also be of interest for 
developing incremental and compositional data flow analysis frameworks. 
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Abstract. Automated reasoning is routinely used in the rigorous con- 
struction and analysis of complex systems. Among different theories, 
arithmetic stands out as one of the most frequently used and at the same 
time one of the most challenging in the presence of quantifiers and un- 
interpreted function symbols. First-order theorem provers perform very 
well on quantified problems due to the efficient superposition calculus, but 
support for arithmetic reasoning is limited to heuristic axioms. In this 
paper, we introduce the ALASCA calculus that lifts superposition reasoning 
to the linear arithmetic domain. We show that ALASCA is both sound 
and complete with respect to an axiomatisation of linear arithmetic. We 
implemented and evaluated ALASCA using the VAMPIRE theorem prover, 
solving many more challenging problems compared to state-of-the-art 
reasoners. 


Keywords: Automated Reasoning - Linear Arithmetic - SMT - Quan- 
tified First-Order Logic - Theorem Proving 


1 Introduction 


Automated reasoning is undergoing a rapid development thanks to its successful 
use, for example, in mathematical theory formalisation [15], formal verification [16] 
and web security [13]. The use of automated reasoning in these areas is mostly 
driven by the application of SMT solving for quantifier-free formulas [6, 12, 29]. 
However, there exist many use case scenarios, such as expressing arithmetic 
operations over memory allocation and financial transactions [1, 18, 20, 32], which 
require complex first-order quantification. SMT solvers handle quantifiers using 
heuristic instantiation in domain-specific model construction [10, 28, 30, 36]. 
While being incomplete in most cases, instantiation requires instances to be 
produced to perform reasoning, which can lead to an explosion in work required 
for quantifier-heavy problems. What is rather needed to address the above use 
cases is a reasoning approach able to handle both theories and complex applications 
of quantifiers. Our work tackles this challenge and designs a practical, low-cost 
methodology for proving first-order quantified linear arithmetic properties. 
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S. Sankaranarayanan and N. Sharygina (Eds.) TACAS 2023, LNCS 13993, pp. 647-665, 2023. 
https: //doi.org/10.1007/978-3-031-30823-9_ 33 


648 K. Korovin et al. 


The problem of combining quantifiers with theories, and especially with 
arithmetic, is recognised as a major challenge in both SMT and first-order proving 
communities. In this paper we focus on first-order, i.e. quantified, reasoning 
with linear arithmetic and uninterpreted functions. In [26], it is shown that the 
validity problem for first-order reasoning with linear arithmetic and uninterpreted 
functions is I7}-complete even when quantifiers are restricted to non-theory sorts. 
Therefore, there is no sound and complete calculus for this logic. 


Quantified Reasoning in Linear Arithmetic — Related Works. In practice, 
there are two classes of methods of reasoning in first-order theory reasoning, 
and in particular with linear real arithmetic. SMT solvers use instance-based 
methods, where they repeatedly generate ground, that is quantifier-free, instances 
of quantified formulas and use decision procedures to check satisfiability of the 
resulting set of ground formulas [10, 28, 36]. Superposition-based first-order 
theorem provers use saturation algorithms [14, 27, 37]. In essense, they start with 
an initial set of clauses obtained by preprocessing the input formulas (initial 
search space) and repeatedly apply inference rules (such as superposition) to 
clauses in the search space, adding their (generally, non-ground) consequences to 
the search space. These two classes of methods are very different in nature and 
complement each other. 

The superposition calculus [4, 31] is a refutationally complete calculus for first- 
order logic with equality that is used by modern first-order provers, for example, 
Vampire [27], E [37], iProver [17] and Zipperposition [14]. There have been a 
number of practical extensions to this calculus for reasoning in first-order theories, 
in particular for linear arithmetic [9, 11, 24]. Superposition theorem provers have 
become efficient and powerful on theory reasoning after the introduction of the 
AVATAR architecture [33, 38], which allows generated ground clauses to be 
passed to SMT solvers. Yet, superposition theorem provers have a major source 
of inefficiency. To work with theories, one has to add theory axioms, for example 
the transitivity of inequality VaVyVz(a < yAy < z => x < z). In clausal form, 
this formula becomes ~z < yV my < zV a < z where ~z < y can be resolved 
against every clause in which an inequality literal s < t is selected. This, with 
other prolific theory axioms, results in a very significant growth of the search 
space. Note that SMT solvers do not use and do not need such theory axioms. 

A natural solution is to try to eliminate some theory axioms, but this is 
notoriously difficult both in theory and in practice. In [26], the LASCA calculus 
was proposed, which replaced several theory axioms of linear arithmetic, including 
transitivity of inequality, by a new inference rule inspired by Fourier-Motzkin 
elimination and some additional rules. LASCA was shown to be complete for the 
ground case. But, after 15 years, LASCA is still not implemented, due to its 
complexity and lack of clear treatment for the non-ground case. As we argue 
in Sect. 5, lifting LASCA to the non-ground setting is nearly impossible as a 
non-ground extension of the underlining ordering is missing in [26]. 


Lifting Lasca to Alasca— Our contributions. In this paper we introduce a 
new non-ground version of LASCA, which we call Abstracting LASCA (ALASCA). 
Our ALASCA calculus comes with new abstraction mechanisms (Sect. 4), inference 


ALASCA: Reasoning in Quantified Linear Arithmetic 649 


rules and orderings (Sect. 5), which all together are proved to yield a sound 
and complete approach with respect to a natural partial axiomatisation of linear 
arithmetic (Theorem 5)*. In a nutshell, we make ALASCA both work and scale 
by introducing (i) a novel variable elimination rule within saturation-based proof 
search (Fig. 3b); (ii) an analogue of unification with abstraction [34] needed for 
non-ground reasoning (Sect. 4); and (iii) a new non-ground ordering and powerful 
background theory for unification, which is not restricted to arithmetic but can be 
used with arbitrary theories (Sect. 5). Asa result, ALASCA improves [26] by ground 
modifications and lifting of LASCA in a finitary way, and complements [3, 40] with 
variable elimination rules that are competible with standard saturation algorithms. 
We also demonstrate the practicality and efficiency of ALASCA (Sect. 6). To this 
end, we implemented ALASCA in Vampire and show that it solves overall more 
problems than existing theorem provers. 


2 Motivating Example 


Consider the following mathematical property: 


Va, y.(f(22,y) > 2a+yV f(z +1,y) > 2+ 2y) > Vedy.f(2,y)>2 (1) 


where f is an uninterpreted function. While property (1) holds, deriving its 
validity is hard for state-of-the-art reasoners: only veriT [2] can solve it. Despite 
its seeming simplicity, this problem requires non-trivial handling of quantifiers 
and arithmetic. Namely, one would need to unify (modulo theory) the terms 
2x and x + 1 (which can be done by instantiating x with 1) and then derive 
f(2,y) >2+yV f(2,y) > 1+ 2y. Further, one also needs to prove that f(2,y) is 
always greater than the minimum of 2 + y and 1 + 2y, for arbitrary y. 

Vampire with ALASCA finds a remarkably short proof as shown in Fig. 1. To 
prove (1) its negation is shown unsatisfiable by first negating and translating into 
clausal form (by using skolemization and normalisation, which shifts arithmetic 
terms to be compared to 0), as listed in lines 1-4. Next a lower bound for f (2, y) is 
established: In line 5, using our new inequality factoring (IF) rule with unification 
with abstraction (see Fig. 3a), the constraint 2x % x + 1 is introduced, and 
establishing thereby that if 22 ~ 1+a and y+2a < 2y+4, then f(2z,y) > 2a-+y. 
After further normalisation, the inequalities sk > f(2,y) and f(2a,y) >2a+y 
are used to derive sk > 2x + y in line 7, using the Fourier-Motzkin Elimination 
rule (FM), while still keeping track of the constraint 2x % x + 1. By applying the 
Variable Elimination rule (VE) twice, the empty clause [O is derived in line 10, 
showing the unsatisfiability of the negation of (1). 

The key steps in the proof (and the reason why it was found in a short time) 
are: (1) the use of the theory rules (FM), and (IF); (2) the use of the new variable 
elimination rule (VE), and finally, a consistent use of unification with abstraction. 
These rules give a significant reduction compared to the number of steps required 
using theory axioms. In particular, not using (FM) would require the use of 
transitivity and generation of several intermediate clauses. As well as shortening 


t proofs and further details of our results can be found in [23] 
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1. f(2z,y) > 2a+yVv f(a+1,y) >2+2y Hypothesis 

2. af(2,y) > sk Skolemized, Neg. Conj. 
3. f(2x,y)—2xr—y>0V f(a@+1,y)—x-2y>0 Normalisation 1 

4. —f(2,y)+sk>0 Normalisation 2 

5. f(2x,y)— 2r -— y > 0V y+ 2r -— ?2y-— zr >0V2r%# zx + 1 (IF) 3 

6. f(2x,y)—2r—-y>0Vx—-y>0vVv0ğ#zr-1 Normalisation 5 

T. —2r—y+sk>0Vx—-y>0vV0ğ%#xr-—1vV2ræ%æ2 FM) 6,4 

8. —2r—-y+sk>0Vx—-y>0v0%#xr-1 Normalisation 7 

9. Og¢a—1 VE) 8 

10. VE) 9 


Fig. 1. A refutational proof using the calculus introduced in this paper. Variables x, y 
are implicitly universally quantified, and sk is an uninterpreted constant. 


the proof, we eliminate the fatal impact on proof search from generating a large 
number of irrellevant formulas from theory axioms. 

Indeed, such short proofs are also found quickly. Similar our previous example, 
Va, y.(f(g(x)+9(a),y) > 2at+yV f(29(z),y) > e+2y) > Ik.Yxraz. f (2g(k), z) > x 
has a short proof of 7 steps, excluding CNF transformation and normalisation 
steps, found by Vampire with ALASCA. This proof was found in almost no time 
(only 37 clauses were generated) but cannot be solved by any other solver. This 
shows the power of the calculus. 


3 Background and Notation 


Multi-Sorted First-Order Logic. We assume familiarity with standard first-order 
logic with equality, with all standard boolean connectives and quantifiers in the 
language. We consider a multi-sorted first-order language, with sorts TQ, 71,..-,Tn- 
The sort To is the sort of rationals, whereas 7,,...,T7 are uninterpreted sorts. 
We write +, for the equality predicate of r. We denote the set of all terms as 
T, variables as V, and literals as L. Throughout this paper, we denote terms by 
s,t,u, variables by x,y,z, function symbols by f,g,h, all possibly with indices. 
Given a term t such that t is f(...), we write sym(t) for f, referring that f is the 
top level symbol of t. We write t : 7 to denote that t is a term of sort 7. A term, 
or literal is called ground, when it does not contain any variables. We refer to the 
sets of all ground terms, and literals as T°, and L? respectively. 

We denote predicates by P, Q, literals by L, clauses by C, D, formulas by F, G, 
and sets of formulas (axioms) by E, possibly with indices. We write F = G to 
denote that whenever F holds in a model, then G does as well. We call a function 
(similarly, for predicates) f uninterpreted wrt some set of equations € if whenever 
EE f(s1...5n) S f(ti...tn), then E H sı SHA... A Sn & ty. A function f is 
interpreted wrt € if it is not uninterpreted. 


Rational Sort. We assume the signature contains a countable set of unary functions 
k : TQ + Tq for every k € Q and refer to k as numeral multiplications. In 
addition, the signature is assumed to also contain a constant 1 : 79, a function 
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+ : TQ X Tg > Tg, and predicate symbols >, >: P(7 x TQ), as well as an arbitrary 
number of other function symbols. For every numeral multiplication k € Q \ {1}, 
we simply write k to denote the term k(1) obtained by the numeral multiplication 
k applied to 1; in these cases, we refer to k as numerals. Throughout this paper, 
we use j, k,l to denote numerals, or numeral multiplications, possibly with indices. 

We write —t to denote the term —1(t). If j,k are two numeral multiplications, 
by (jk) and (j + k) we denote the numeral multiplication that corresponds to 
the result of multiplying and adding the rationals/numerals j and k, respectively. 
For applications of numeral multiplications j(t) we may omit the parenthesis 
and write jt instead. If we write +k, or —k for some numeral k, we assume k 
itself is positive. We write + (and +) to denote either of the symbols + or — 
(and respectively — or +). For q E Q we define sign(q) to be 1 if q > 0, —1 if 
q < 0, and 0 otherwise. We call +, >, >, 1, and the numeral multiplications the 
Q symbols. Finally, an atomic term is either a logical variable, or the term 1, or a 
term whose top level function symbol is not a Q symbol. 

A Q-model interprets the sort 7g as Q, and all Q symbols as their corresponding 
functions/predicates on Q. We write Q = C iff for every Q-model M, ME C 
holds. If € is a set of formulas, we call a model M a E-model if M = E. 


Term Orderings. We write u[s] to denote that s is a subterm of u, where the 
subterm relation is denoted via <. That is, s < u; similar notation will also be 
used for literals L[s] and clauses C[s]. We denote by u[s +> t] the term resulting 
from replacing all subterms s of u by t. 

Multisets (of term, literals) are denoted with {...}. For a multiset S and 
natural number n € N, we define 0« S =, and n * S = (n— 1 * S) U S for n > 0. 

Let < be a relation and = be an equivalence relation. By <@"' we denote 
the multiset extension of <, defined as the smallest relation satisfying M U 
{81,...,5n} <™ NU {t}, where M = N, n > 0, and s; <tforl<i<n. 
For n,m € N, by <“'™! we denote the tenid malliset extension, defined by 
(4,8) <u! (4, T) iff m* S <2" n xT. We omit the equivalence relation = if it 
is clear in the context. 

Let s,t,t; be terms, 0,6’ be ground substitutions and € be a set of axioms. We 
write s =g t for E Est and 6 =e 6’ iff for all variables x we have 20 =¢ x6’. 
We say that s is a E-subterm of t (s <e t) if s =e t, or t =e f(tı...tn) and 
s <e ti. We also say that s is a strict E-subterm of t (s de t) if s de t and s#et. 


4 Theoretical Foundation for Unification with Abstraction 


Our motivating example from Sect. 2 showcases that first-order arithmetic reason- 
ing requires (i) establishing syntactic difference among terms (e.g. 2x and x + 1), 
while (ii) deriving they have instances that are semantically equal in models of a 
background theory € (e.g. the theory Q). 

A naive approach addressing (i)-(ii) would be to use an axiomatisation of the 
background theory €, and use this axiomatisation for proof search in uninterpreted 
first-order logic. Such an approach can however be very costly. For example, 
even a relatively simple background theory AC axiomatizing commutativity and 
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1 fn uwa(s,t) 

2 eqs + {s x t}; o + 0; C + Ú; 

3 while eqs £ 0) 

4 $ œ t + eqs.pop(); 

5 if xie {sxa uux zgr} for somezsEeV,x<4u 
6 (a,eqs,C) + (o U {x & u}, egs,C) {x u}; 
7 else if canAbstract(å, t) 

8 C.push(s % t); 

9 else if å = f(s1...sn),t= f(ti...tn) 
10 eqs.push({s1 X tı... Sn © tn}) 

11 else 
12 return L; 
13 return (o,C); 


Algorithm 1: Computing an abstracting unifier uwa. 


associativity of ~, that is AC = {x+y © yta2,r+(y+z) © (x+y)+z}, would make 
a superposition-based theorem prover derive a vast amount of useless /redundant 
formulas as equational tautologies. An approach to circumvent such inefficient 
handling of equality reasoning is to use unification modulo AC, or in general 
unification modulo E, as already advocated in [22, 34, 40]. In this section we 
describe the adjustments we made towards unification modulo €, allowing us 
to introduce unification with abstraction (Sect. 4.1). We also show under which 
condition our method can be used to turn a complete superposition calculus using 
unification modulo € into a complete superposition calculus using unification with 
abstraction. Concretely, we show how this can be used for the specific theory of 
arithmetic Aeq in the calculus ALASCA (Sect. 4.2). 


4.1 Unification with Abstraction — UWA 


In a nutshell, unification modulo € finds substitutions ø that make two terms s, t 
equal in the background theory, i.e. E — so ~ to. While unification modulo € 
removes the need for axiomatisation of E during superposition reasoning, it comes 
with some inefficiencies. Most importantly, in contrast to syntactic unification, 
there is no unique most general unifier mgu(s,t) when unifying modulo E but 
only minimal complete sets of unifiers mcug(s,t), which can be very large; for 
example, unification modulo AC is doubly exponential in general [22]. 
Bypassing the need for unification modulo €, fully abstracted clauses are used 
in [40], without the need for axiomatisation of the theory € and without compro- 
mising completeness of the underlining superposition-based calculus. Our work 
extends ideas from [40] and adjusts unification with abstraction (uwa) from [34], 
allowing us to prove completeness of a calculus using uwa (Theorem 3). 


Example 1. Let us first consider the example of factoring the clause p(27) Vp(a+1), 
a simplified version of the unification step performed in line 5 in Fig. 1. That 
is, unifying the literals p(2x) and p(x + 1), in order to remove duplicate literals. 
Within the setting of [40], these literals would only exist in their fully abstracted 
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form, which can be obtained by replacing every subterm t : Tg that is not a variable 
by a fresh variable x, and adding the constraint x % t to the corresponding clause. 
Hence, the clause p(2x) V p(x+1) is transformed to p(y) Vp(z) Vy # 24Vz # a+1 
in [40]. Unification then becomes trivial: we would derive the clause p(y) V y # 
2x V y Æ «+1 by factoring, from which p(2x) V 2a % x + 1 is inferred using 
equality factoring and resolution. 

Within unification with abstraction, we aim at cutting out intermediate steps 
of applying abstractions, equality resolution and factoring. As a result, we skip 
unnecessary consequences of intermediate clauses, and derive the conclusion 
p(2x) V 2x % x+ 1 straight away. To this end, we introduce constraints only 
for those s,t : Tg on which unification fails. We thus gain the advantage that 
clauses are not present in the search space in their abstracted forms, increasing 
efficiency in proof search. Further, our unification with abstraction approach is 
parametrized by a predicate canAbstract to control the application of abstraction, 
as listed in Algorithm 1. This is yet another significant difference compared to fully 
abstracted clauses, as in the latter, abstraction is performed for every subterm 
t: TQ without considering the terms with which t might be unified later. 


Our uwa method can be seen as a lazy approach of full abstraction from [40]. We 
compute so-called abstracting unifiers uwa(s,t) = (0,C) in Algorithm 1, allowing 
us to replace unification modulo € by unification with abstraction. 


Definition 1 (Abstracting Unifier). Let o be a substitution and C a set of 
literals. A partial function uwa that maps two terms s, t either to L or to a pair 
(a,C) = uwa(s,t) is called an abstracting unifier. 


The abstracting unifier uwa(s,t) computed by Algorithm 1 is parametrized 
by the relation canAbstract. The intuition of this relation is that canAbstract(s, t) 
holds for terms s and t, when s ~ t might hold in the background theory €. To 
ensure that unification with abstraction can replace unification modulo E, we 
impose the following additional properties over the abstract unifier uwa(s, t). 


Definition 2 (uwa Properties). Let o be a substitution and C a set of literals. 
Consider s,t E€ T be such that uwa(s,t) = (a,C) and let 0 be an arbitrary ground 
substitution. We say uwa is 


— E-sound iff E = (s ~ tha VC; 

— E-general iff Yu € mcug(s,t).dp.cp =e H; 

— E-minimal iff E — (s = t)o@ => EF (-C)6; 

— subterm-founded with respect to the clause ordering <, iff for every unin- 
terpreted function or predicate f, every literal L|o], it holds that E |= (s ~ 
t)6 => C0 < Li f(s)]6 or CO < Lif (td. 


Further, uwa is E-complete if, for all s,t € T with uwa(s,t) = L, we have 
mcueg(s,t) = 0. 


Definition 2 is necessary to lift inferences using unification with abstraction. 
We thereby want to assure that, whenever C does not hold, then s and ¢ are 
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equal; hence abstracting unifiers uwa(z, y) = (0, £ + y # y + x) would be unsound. 
The €-generality property enforces that substitutions introduced by uwa are 
general enough in order to still be turned into a complete set of unifiers. As such, 
€-generality is needed to rule out cases like uwa(x + y, 2) = Hz 4 0, y+ 2},0), 
which would not be able to capture, for example, the substitution {x 4 1,y > 1}. 
We note that we use uwa to extend counterexample-reducing inference systems 
(see Definition 4), allowing inductive completeness proofs. As these inference 
systems need to derive conclusions that are smaller than the premises, we need 
the subterm-foundedness property to make sure to only introduce constraints that 
are smaller than the premises as well. If we have a look at the previous properties, 
we see that all of them are fulfilled if uwa(s,t) = L. Therefore we need to make 
sure that uwa only returns L when s and t are not unifiable modulo €; this is 
captured by €-completeness. 

In addition to properties of abstract unifiers uwa(s, t), we also impose conditions 
over the canAbstract relation that parametrizes uwa(s,t). As Algorithm 1 only 
introduces equality constraints for subterm pairs that should be unified, a resulting 
abstracting unifier uwa(s, t) is sound. Further, under the assumption that the clause 
ordering is defined as in standard superposition (e.g. using multiset extensions 
of a simplification ordering that fulfills the subterm property), the abstracting 
unifier uwa(s,t) is also subterm-founded. However, to ensure that uwa(s,t) is 
also minimal, interpreted functions should not be treated as uninterpreted ones; 
hence the canAbstract relation needs to always trigger abstraction on interpreted 
functions. Finally, we require that canAbstract does not skip terms which are 
potentially equal modulo €, in order to guarantee completeness. Hence, we define 
the following properties for canAbstract. 


Definition 3 (canAbstract Properties). Let s,t € T. The canAbstract relation 


— captures E, iff for all s, t, it holds that 4p.€ F (s ~ t)p => canAbstract(s, t); 
— guards interpreted functions, iff for all s, t, where sym(s) = sym(t) is an 
interpreted function, canAbstract(s,t) holds. 


Based on the above, we derive the following result. 


Theorem 1. The abstracting unifier uwa computed by Algorithm 1 is subterm- 
founded and sound. If canAbstract guards interpreted functions, then uwa is E- 
general and E-minimal. If canAbstract guards interpreted functions and captures 
E, then uwa is E-complete. 


4.2 UWA Completeness 


We now show how unification with abstraction (uwa) can be used to replace 
unification modulo € in saturation-based theorem proving [3]. We recall from [3] 
that in order to show refutational completeness of an inference-system J’, one 
constructs a model functor I that maps sets of ground clauses N to candidate 
models Jy. In order to show that I is refutationally complete, one needs to show 
that if N is saturated with respect to I’, then Iy F N. For this, the notion of a 
counterexample-reducing inference system is introduced. 
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Definition 4. We say an inference system I’ is counterexample reducing, with 
respect to a model functor I and a well-founded ordering on ground clauses <, if 
for every ground set of clauses N and every minimal C E€ N such that In ¥ C, 
there is an inference 


Ch ve Ch C 
D 


where Vi.In E Ci, Vi.Ci < C, D < C, and In ¥ D. 


We then have the following key result. 


Theorem 2 (Bachmair&Ganzinger [3]). Let < be a well-founded ordering 
on ground clauses and I be a model functor. Then, every inference system that is 
countererample-reducing wrt < and I is refutationally complete. 


This result also holds for an inference system being refutationally complete wrt 
E if for every N it holds that Iy = E. When constructing a refutationally complete 
calculus, one usually first defines a ground counterexample-reducing inference 
system and then lifts this calculus to a non-ground inference system. Lifting is 
done such that, if the ground inference system is counterexample reducing, then 
its lifted non-ground version is also counterexample reducing. 

We next show how to transform a lifting of a counterexample-reducing infer- 
ence system that uses unification modulo € into a lifting using unification with 
abstraction. That is, given a counterexample-reducing inference-system using 
unification modulo € to define its rules, we construct another counterexample- 
reducing inference system that uses uwa instead. As we only transform rules that 
use unification, we introduce the notion of a unifying rule. 


Definition 5. An inference rule y is a unifying rule if it is of the form 


Cı ags Cn 
Do 


, where o € mcu¢(s,t). 


We also define the mapping owwa that maps unifying inferences y to Yuwa as 


Yuwa = ( C1 a 2 , where (a,C) = uwa(s, t)) 

Soundness of the unifying rule y alone however does not suffice to show 
soundness of Yywa- Therefore we introduce a stronger notion of soundness that 
holds for all the rules we will consider to lift. 


Definition 6. Let y be a unifying rule. We say y is strongly sound iff 
E,C,...Cn,C ERE sxt D. 


Lemma 1. Assume that y is strongly sound and uwa is sound. Then, Yuwa iS 
sound. 
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We note that not every inference can be transformed using Oya, without 
compromising completeness. To circumvent this problem, we consider the notion 
of compatibility with respect to transformations. 


Definition 7. Let y be a unifying inference. Then, y unifies strict subterms iff 
for every grounding 0, u € {s,t} there is an uninterpreted function or predicate 
f, a literal L| f(u)], and clause C” € {C1 ...Cn, C}, such that Li f(u)]6 < C’0. 


Note that in the above definition we usually have that L[f(s)] or L[f(t)] is 
some literal of one of the premises. 


Definition 8 (uwa-Compatibility). We say an inference y is uwa compatible 
if it is a unifying inference, strongly sound, and unifies strict subterms. 


Theorem 3. Let uwa be a general, compatible, subterm-founded, complete, and 
minimal abstracting unifier. If T is the lifting of a countererample-reducing 
inference system I with respect to a model functor I, and clause ordering <, then 
Twa = {Ywa | y E T, y is uwa-compatible}U{y € I | y is not uwa-compatible} 
is the lifting of an inference system T, that is countererample-reducing with 
respect to I and <. 


Theorem 1 and Theorem 3 together imply that, given a compatible inference 
system, we need to only specify the right canAbstract predicate in order to perform 
a lifting using uwa. In Sect. 5 we introduce the calculus ALASCA, a concrete 
inference system with the desired properties, for which a suitable predicate 
canAbstract can easily be found. 


5 ALASCA Reasoning 


We use the lifting results of Sect. 4 to introduce our ALASCA calculus for reasoning 
in quantified linear arithmetic, by combining superposition reasoning with Fourier- 
Motzkin type inference rules. While an instance of such a combination has 
been studied in the Lasca calculus of [26], LASCA is restricted to ground, i.e. 
quantifier-free, clauses. Our ALASCA extends LASCA with uwa and provides an 
altered ground version ALASCA’ (Sect. 5.1) which efficiently can be lifted to the 
quantified domain (Sect. 5.2). As quantified reasoning with linear real arithmetic 
and uninterpreted functions is inherently incomplete, we provide formal guarantess 
about what ALASCA can prove. Instead of focusing on completeness with respect 
to Q-models as in [26], we show that ALASCA is complete with respect to a partial 
axiomatisation Ag of Q-models (Sect. 5.2). 


5.1 The ALASCA Calculus — Ground Version 


The ALASCA calculus uses a partial axiomatisation Ag of Q-models, and handles 
some Q-axioms via inferences and some via uwa. We therefore split the axiom set 
Ag into Aeq and Aineg, as listed in Fig. 2. 

Our ALASCA calculus modifies the LASCA framework [26] to enable an efficient 
lifting for quantified reasoning. For simplicity, we first present the ground version 
of ALASCA, which we refer to ALASCA’, whose one key benefit is illustrated next. 
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Ag = Aeq U Aineg Aing = {£ >yAy>2z72> 2} 
Aeq = AC U{r>y>r+z>y+z} 
U {jx + kra (j+ k) |j, k €Q} U{r>yVrxyVy>rr} 
U {j(k(a)) ~ (jk)a | j,k € Q} U {>(£ > 2)} 
U {1(2) ~ z} U{r> y4 (1 >yVr a y)} 
U {k(x +y) ~ kz + ky |k € Q} U {x > y > +kzr > +ky | +k € Q} 
U {x +0 ~ z, 0x = 0} U {x > y > —ky > —kx | —k € Q} 


Fig. 2. Axioms handled by the ALASCA calculus. All are implicity universally quantified. 


Example 2. One central rule of ALASCA is the Fourier-Motzkin variable elimina- 
tion rule (FM). We use (FM) in line 7 of Fig. 1, when proving the motivating exam- 
ple of Sect. 2, given in formula (1). Namely, using (FM), we derive —2x— y+ sk > 0 
from f(2a,y) — 2x — y > 0 and —f(2,y) + sk > 0, under the assumption that 
2x ~ 2. The (FM) rule can be seen as a version of the inequality chaining rules 
of [3] , chaining the inequalities sk > f(2,y) and f(2x,y) > 2a + y. Moreover, 
the (FM) rule can also be considered a version of binary resolution, as it resolves 
the positive summand f(2z,y) with the negative summand — f(2,y), mimicing 
thus resolution over subterms, instead of literals. The main benefit of (FM) comes 
with its restricted application to maximal atomic terms in a sum (instead of its 
naive application whenever possible). 


ALASCA? Normalization and Orderings. Compared to LASCA [26], the major 
difference of ALASCA? comes with focusing on which terms are being considered 
equal within inferences; this in turn requires careful adjustments in the underlying 
orderings and normalization steps of Auasca®, and later also in unification within 
ALASCA. In LASCA terms are rewritten in their so-called Q-normalized form, while 
equality inference rules exploit equivalence modulo AC. Lifting such inference 
rules is however tricky. Consider for example the application of the rewrite rule 
j(ks) > (jk)s (triggered by j(ks) ~ (jk)s) over the clause C[jx, x]. In order to 
lift all instances of this rewrite rule, we would need to derive C[(jk)a, ka] for 
every k € Q, which would yield an infinite number of conclusions. In order to 
resolve this matter, ALASCA’ takes a different approach to term normalization 
and handling equivalence. That is, unlike LASCA, we formulate all inference rules 
using equivalence modulo Aeq, and do not consider the normalization of terms as 
simplification rules. 

As ALASCA? rules use equivalence modulo Aeg, we also need to impose that 


eq) 


the simplification ordering used by ALASCA’ is Aeg-compatible. Intuitively, Aeq- 
compatibility means that terms that are equivalent modulo Aeq are in one equiv- 
alence class wrt the ordering. This allows us to replace terms by an arbitrary 
normal form wrt these equivalence classes before and after applying any inference 
rules, allowing it to use a normalization similar to Q-normalization that does 
not need to be lifted. Hence, we introduce Aeg-normalized terms as being terms 
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whose sort is not Tg or of the form t(kiti +--- + kntn), such that Vi.k; € Z \0, 
Vi Æ j.ti Æ tj, Vit; is atomic, k is positive, and ged({k, ki... kn}) = 1. Obviously 
every term can be turned into a A,,-normalized term. For the rest of this section 
we assume terms are A,q-normalized, and write = for =4„: We also assume that 
literals with interpreted predicates © are being normalized (during preprocessing) 
and to be of the form to 0. We write s © t for equalities, with sorts different 
from Tg, and for equalities of sort Tọ that can be rewritten to s = t such that s is 
an atomic term. Finally, ALASCA’ also extends LASCA by not only handling the 
predicates > and %, but also >, and %, which has the advantage that inequalities 
are not being introduced in purely equational problems in ALASCA’. 

As discussed in Example 2, the (FM) rule of ALASCA® is similar to binary 
resolution, as it can be seen as “resolving” atomic subterms instead of literals. To 
formalize such handling of terms in (FM), we distinguish so-called atoms(t), atoms 
of some term t. Doing so, given an Aeg-normalized term t = ¢(t1kitit. . Ey Rnitn)s 
we define atoms*(t) = (k, kı * f+, ti} U... U kn *{ En tn}) and atoms(t) = 
(k, kı*{tı }U. . .Ukn*{tn }). We extend both of these functions f € {atoms, atoms* } 
to literals as follows: f(to0) = f(t), assuming that the term t has been normalised 
to ¢ = 1 before. For (dis)equalities s ~ t (s # t) of uninterpreted sorts, we define 


atoms to be (1, {s,t}). Further we define maxAtoms(t), to be the set of maximal 
terms in atoms(t) with respect <, and maxAtom(t) = to if maxAtoms(t) = {to}. 


ALASCA’ Inferences. The inference rules of ALASCA? are summarized in Fig. 3a. 
All rules are parametrized by a Aeg-compatible ordering relation < on ground 
terms, literals and clauses. Underlining a literal in a clause or an atomic term in 
a sum means that the underlined expression is non-strictly maximal wrt to the 
other literals in the clause, or atomic terms in the sum. We use double-underlining 
to denote that the expression is strictly maximal. We call L the set of potentially 
productive literals, defined as all equalities and inequalities with strictly maximal 
atomic term with positive coefficient. 


Finding a right ordering relation is non-trivial, as many different requirements, 
like compatibility, subterm property, well-foundedness, and stability under substi- 
tutions, need to be met [25, 26, 39, 41]. For ALASCA, we use a modified version 
of the QKBO ordering of [26], with the following two modifications. 


(i) Firstly, the ALASCA ordering is defined for non-ground terms. This means 
that the ordering needs to handle subterms with sums where there is no maximal 
atomic summand, like the term z + y. In addition, our ordering needs to be 
stable under substitutions in order to work with non-ground terms. Note however 
that our atom functions atoms and atoms*~ are not stable under substitutions, as 
the term f(a) — f(y) and the substitution {xz ++ y} demonstrates. Therefore, we 
parametrize our ALASCA ordering by the relation subsSafe. The subsSafe relation 
fulfils the property that if subsSafe( 4 (+1 hy ty +-->- +n kntn)), then there is no 
substitution 0 such that +,;k;t,0 = +,k,;t;0, for any i,j. In general, checking the 
existence of such a @ is as hard as unifying modulo Aeq. Nevertheless, we can 
overapproximate the subsSafe relation using the canAbstract predicate. 
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Fourier-Motzkin Elimination Tight Fourier-Motzkin Elimination 
CV +jstti 210 C2 V —ks' + t2 Z2 0 Cı V+js+tı >0 C2 V —ks' + t2 > 0 
7 FM FM2 
C1 V C2 V kti + jt2 > 0 (EM) C1 V C2 V kti + jt2 > OV —ks' + to %0 ( ) 
where — js+tı >0> C1 where — js+ti >0> Cı 
— —ks' + t2 >02 C2 — —ks' + t2 > 0 > C2 
= ol 1 
—s=s -—s=s 


— {>} C {21,22} C {> 2} 


Inequality Factoring Term Factoring 
CV +jst+t: 21 0V +ks' + te Z2 0 Cvjst+ks'+to0 (TF) 
IF “Cv (4 +k)s' +t00- 
CV kh jte 230V 4k 420 CN ERJ EESO 
i i where s= 3 
where — s=s A 
sete , o E {>,>, 8, ¥} 
A ya £ (C oa tı Z! 0).ks the a 0 z f oF s,s’ € maxAtoms(C' V js + ks’ +t 00) 
€ (CV ks' + te 22 0).js +t 2102 there is no uninterperted literal in C 
— Rie {>, 2} 
> if 21=>, and 22=> 
- 23= 
> else 
Contradiction Superposition 
LET (Triv) CiıVsŽt C2 V Ls‘) a 
u 
Cı V Ca V Lis > 4] P 
where — o€ {>,>,%, #4} 
—-keEQ where s=s' 
- QH theo s2t= C 
L{s'} € L. & L[s'] > C2 or 
L{s'] g L$. & L[s'] = Ce 
s’ < x € maxAtoms(L[s']) 
s ~ tV Ci < C2 V Lfs'] 
Equality Resolution Equality Factoring 
CVs8t Vs Št 
— (EF 
CVt #teVset (EF) 
where 


s=s' 
sete HCVset 


(a) Rules of the ground calculus ALAsca®. 


Variable Elimination 


CVV «+b 2.0V V -+b Z;0V V rth 0V V «+b #0 


ier jEJ kek leL (VE) 
CV V bb +b; 245 OV V bi-b>0 v V b-b 2:0 
iel,jeJ i€L,kEK~ i€IEL 
v V bj+o>0 v V bj +b 250 
jEJ,kEK+ GET LEL 
A Vv V bkı — bka 2 0 V V bk-b>20 
KACE ki E K+ koe K— kEK+, IEL 
(e v V b-b [0 
kEK-,lEL 
v V by — by #0 
11 ,l2€L 
where 
— «x is an unshielded variable -= Zi ZE {2,>} 
- K-=K\Kt zey ee tee {Zi Z} 
— C does not contain x ~ A(S) otherwise 


(b) Variable elimination rule used for lifting ALASCA’. 


Fig. 3. Inference rules used to define the calculus ALASCA. 
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(ii) Secondly, we adjusted the ALASCA ordering to be A.q-compatible, instead 
of AC-compatible. We modified the literal ordering of ALASCA, such that literals 
are ordered by all their atoms using the weighted multiset extension of <, instead 
of only using the maximal one of each literal L as in [26]. 

We define a model functor Z}, mapping clauses to Ag-models (see [23] for 
details) and conclude the following. 


Theorem 4. ALASCA? is a countererample-reducing inference system with re- 
spect to T, and <. 


5.2 ALASCA Lifting and Completeness 


Variable Elimination. Theorem 4 establishes completeness of ALASCA’ for ground 
clauses wrt Ag. We next lift this result (and calculus) to non-ground clauses. 

We introduce the concept of an unshielded variable. We say a term t : Tg is 
a top level term of a literal L if t € atoms(L). We call a variable x unshielded 
in some clause C if x is a top level term of a literal in C, and there is no literal 
with an atomic top level term ¢[z]. Observe that within the ALASCA’ rules, only 
maximal atomic terms in sums are being used in rule applications. This means, 
lifting ALASCA’ to ALASCA is straightforward for clauses where all maximal terms 
in sums are not variables. Further, due to the subterm property, if a variable is 
maximal in a sum then it must be unshielded. Hence, the only variables we have 
to deal within ALASCA rule applications are unshielded ones. 

The work of [40] modifies a standard saturation algorithm by integrating it 
with a variable elimination rule that gets rid of unshielded variables, without com- 
promising completeness of the calculus. Based on [40] and the variable elimination 
rule of [3], we extend ALAsca® with the Variable Elimination Rule (VE), as given 
in Fig. 3b. In what follows, we show that the handling of unshielded variables in 
Fig. 3b can naturally be done within a standard saturation framework. 

The (VE) rules replaces any clause with a set of clauses that is equivalent and 
does not contain unshielded variables. We assume that the clause is normalized, 
such that in every inequality x only occurs once with a factor 1 or —1, whereas for 
for equalities, x only occurs with factor 1. A simple example for the application of 
(VE) is the clause a — x > 0V x—b>0Va+b+ 7x > 0, where x € V, and a,b are 
constants. By reasoning about inequalities, it is easy to see that this is equivalent 
toa>xVa+b> zV z> b, thus further equivalent to a > bV a+b > b, which 
illustrates the benefit of variable elimination through (VE). 


Lemma 2. The conclusion of (VE) is equivalent to its premise. 


ALASCA Calculus - Non-Ground Version with Unification with Abstraction. We 
now define our lifted calculus ALASCA, as follows. Let ALASCA be the calculus 
ALASCA? being lifted for clauses without unshielded variables. We define ALASCA 
to be ALASCA chained with the variable elimination rule. That is, the result of 
every rule application is simplified using (VE) as long as applicable. 
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Theorem 5. ALASCA is the lifting of a counterexample-reducing inference system 
for sets of clauses without unshielded variables. 


Theorem 5 implies that ALASCA is refutationally complete wrt Ag for sets of 
clauses without unshielded variables. As (VE) can be used to preprocess arbitrary 
sets of clauses to eliminate all unshielded variables, we get the following. 


Corollary 1. If N is a set of clauses that is unsatisfiable with respect to Ag, 
then N can be refuted using ALASCA. 


We conclude this section by specifying the lifting of ALASCA’ to get ALASCA”. 
To this end, we use our uwa results and properties for unification with abstraction 
(Sect. 4). We note that using unification modulo Aeq would require us to develop 
an algorithmic approach that computes a complete set of unifiers modulo Aeq, 
which is a quite challenging task both in theory and in practice. Instead, using 
Theorem 1 and Theorem 3, we need to only specify a canAbstract predicate that 
guards interpreted functions and captures Aeg within uwa. This is achieved 
by defining canAbstract(s,t) if any function symbol f € {sym(s),sym(t)} is an 
interpreted function f € QU {+}.This choice of the canAbstract predicate is a 
slight modification of the abstraction strategy one_side_interpreted of [34]. 
We note that this is not the only choice for the predicate to fulfil the canAbstract 
properties. Consider for example the terms f(z) + a, and a+ b. There is no 
substitution that will make these two terms equal, but our abstraction predicate 
introduces a constraint upon trying to unify them. In order to address this, we 
introduce an alternative canAbstract predicate that compares the atoms of a term, 
instead of only looking at the outer most symbol (Sect. 6). 

We believe more precise abstraction predicates can improve proof search, as 
evidenced by our experiments using second abstraction predicate (Sect. 6). 


6 Implementation and Experiments 


We implemented ALASCA ° in the extension of the VAMPIRE theorem prover [27]. 
Pp 


Benchmarks. We evaluated the practicality of ALASCA using the following six 
sets of benchmarks, resulting all together in 6374 examples, as listed in Table 1 
and detailed next. (i) We considered all sets of benchmarks from the SMT-LIB 
repository [7] set that involve real arithmetic and uninterpreted functions, but no 
other theories. These are the three benchmark sets corresponding to the LRA, 
NRA, and UFLRA logics in SMT-LIB. (ii) We further used Sledgehammer 
examples generated by [15], using the SMT-LIB syntax. From the examples of [15], 
we selected those benchmarks that involve real arithmetic but no other theories. 
We refer to this benchmark set as SH. (iii) Finally, we also created two new sets of 
benchmarks, TRIANGULAR, and LIMIT, exploiting various mathematical properties. 
The TRIANGULAR suite contains variations of our motivating example from 
Sect. 2, and thus comes with reasoning challenges about triangular inequalities 


5 available at https: //github.com/vprover/vampire/tree/alasca 
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Benchmarks (#) || ALASCA|Cvc5|VAMPIRE| YICES|ULTELIM|SMTINT | VERIT ||solved 
all (6374) 5744| 5626 5585| 5531 5218 828| 465]| 5988 
LRA (1722) 1572) 1401 1396| 1722 1469 623 89| 1722 
NRA (3814) 3800] 3804 3803/3809 3669 0 0|) 3812 
UFLRA (10) 10| 10 10 0 0 10|) 10 10 
TRIANGULAR (34) 24| 10 13 0 0 0 6 25 
Lirr (280) 100) 90 81 0 80 of 90]) 100 
SH (514) 238| 311 282 0 0 195) 270 319 


Table 1. Experimental results, showing the numbers of solved problems. 


and continuous functions. The LIMIT benchmark set is comprised of problems 
that combine various limit properties of real-valued functions. 

Experimental Setup. We compared our implementation against the solvers from 
the Arith (arithmetic) division of the SMT-COMP competition 2022. These 
solvers, given in columns 3-8 of Table 1, are: Cvc5 [5], VAMPIRE [35], YICES [19], 
ULTELIM [8], SMTINT [21], and VERIT [2]. We note that VAMPIRE is run in its 
competition portfolio mode, which includes the work from [34]. ALASCA uses the 
same portfolio but implements our modified version of unification with abstraction 
(Sect. 4), disabling the use of theory axioms relying on our new ALASCA rules 
(Sect. 5). We ran our experiments using the SMT-COMP 2022 competition setup: 
based on the StarExec Iowa cluster, with a 20 minutes timeout and using 4 cores. 
Benchmarks, solvers and results are publicly available®. 

Experimental Results. Table 1 summarizes our experimental findings and indicates 
the overall best performance of ALASCA. For example, ALASCA outperforms the 
two best arithmetic solvers of SMT-COMP 2022 by solving 118 more problems 
than Cvc5 and 159 more problems than VAMPIRE. 


7 Conclusions and Future Work 


We introduced the ALASCA calculus and drastically improved the performance 
of superposition theorem proving on linear arithmetic. ALASCA eliminates the 
use of theory axioms by introducing theory-specific rules such as an analogue 
of Fourier-Motzkin elimination. We perform unification with abstraction with a 
general theoretical foundation, which, together with our variable elimination rules, 
serves as a replacement for unification modulo theory. Our experiments show 
that ALASCA is competitive with state-of-the-art theorem provers, solving more 
problems than any prover that entered the arithmetic division in SMT-COMP 
2022. Future work includes designing an integer version of ALASCA, developing 
different versions for the canAbstract predicate, and improving literal/clause 
selections within ALASCA. 
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Abstract. Parity games are two-player zero-sum games of infinite du- 
ration played on finite graphs for which no solution in polynomial time 
is still known. Solving a parity game is an NPMco-NP problem, with the 
best worst-case complexity algorithms available in the literature running 
in quasi-polynomial time. Given the importance of parity games within 
automated formal verification, several practical solutions have been ex- 
plored showing that considerably large parity games can be solved some- 
what efficiently. Here, we propose a new approach to solving parity games 
guided by the efficient manipulation of a suitable matrix-based represen- 
tation of the games. Our results show that a sequential implementation 
of our approach offers very competitive performance, while a parallel im- 
plementation using GPUs outperforms the current state-of-the-art tech- 
niques. Our study considers both real-world benchmarks of structured 
games as well as parity games randomly generated. We also show that 
our matrix-based approach retains the optimal complexity bounds of the 
best recursive algorithm to solve large parity games in practice. 


Keywords: Parity games - Formal verification - Parallel computing. 


1 Introduction 


Parity games are one of the most useful and effective algorithmic tools used in 
automated formal verification [18,5,2]. Indeed, several computational problems, 
such as model checking and automated synthesis using temporal logic specifi- 
cations, can be reduced to the solution of a parity game [5,2]. More formally, 
a parity game is a two-player zero-sum game of infinite duration played on a 
finite graph. Since these games are determined [14,8], solving them is equiva- 
lent to finding a winning strategy for one of the two players in the game; or, 
similarly, deciding from which vertices in the graph one of the two players in 
the game can force a win no matter the strategy that the other player makes 
use of. The main question regarding parity games is that of the computational 
complexity of finding a solution of the game, a problem that is known to be 
in NP N co-NP [11]. However, despite decades of research, a polynomial-time 
algorithm to solve such games remains elusive. The best-known decision proce- 
dures to solve parity games, most of them recently developed [4,13], run in quasi- 
polynomial time, which provide better worst-case complexity upper bounds than 
previous exponential-time approaches [18] found in the parity games literature. 


© The Author(s) 2023 
S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 666-683, 2023. 
https: //doi.org/10.1007/978-3-031-30823-9 34 
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The importance of parity games in the solution of real-life automated verifi- 
cation problems, and the lack of a polynomial-time decision procedure to solve 
such games, has motivated the development and implementation of algorithms 
that can solve parity games somewhat efficiently in practice, despite their known 
worst-case exponential time complexity. In the quest for developing such decision 
procedures, several different approaches have been investigated in the last two 
decades, ranging from solutions that try to improve/optimise on the choice of 
high-level algorithm to reason about parity games, the programming language 
used to implement such a solution, the concrete data structures used to represent 
the games, or the type of hardware architecture used for deployment [7,6,17,9]. 

Progress solving parity games in practice has been made in different direc- 
tions. In [7], a state-of-the-art implementation of the best-known algorithms for 
solving parity games was presented. In this work, two algorithms were found to 
deliver the best performance in practice, namely, Zielonka’s recursive algorithm 
(ZRA [18]) and priority promotion [3], with the former showing slightly better 
performance when solving random games and a selection of structured games 
for model checking, and the latter outperforming ZRA when solving a selection 
of structured games for equivalence checking. But, overall, the two algorithms 
expose extremely similar performance in practice, including that of a parallel 
implementation of ZRA. Another attempt to improve the performance of solv- 
ing parity games is presented in [6]. In this work, better performance is sought 
through a parallel implementation of ZRA, known to consistently expose the 
best performance in different platforms and for different types of games. 

These two works [7,6] contain two strikingly opposing conclusions. While 
in [7] the parallel implementation of ZRA is even outperformed by the best 
sequential implementation of the same algorithm, in [6] significant gains in per- 
formance are observed when parallelising the computation of ZRA — which may 
solve a large set of random parity games between 3.5 and 4 times faster than the 
sequential implementation of the same algorithm. These two results, arguably, 
both conforming with the state of the art in the solution of parity games in 
practice, indicate that no definitive conclusion can be made into what the best 
approach to solving parity games in practice is, let alone whether considering 
a parallel implementation would necessarily produce better results than its se- 
quential version. In this paper, we present a new approach to solving parity 
games, and investigate some of the issues exposed by the two above papers. 

More specifically, motivated by the need to find effective new techniques for 
solving parity games, in particular in large practical settings, in this paper we: 


1. propose a novel matrix-based approach to solving parity games, based on 
ZRA [18,13], arguably, the best-performing algorithm in practice [7]; 

2. study the complexity of our matrix-based procedure, and show that it retains 
the optimal complexity bounds of the best algorithms for parity games [13]; 

3. develop a parallel implementation, which takes advantage of methods and 
hardware for matrix manipulation using sophisticated GPU technologies; 

4. investigate a number of alternative implementations of our matrix-based 
approach in order to better assess its usefulness in practical settings. 
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Our matrix-based approach, whose parallel implementation outperforms the 
state-of-the-art solvers for parity games, consists in the reduction of key oper- 
ations on parity games as simple computations on large matrices, which can 
be significantly accelerated in practice using sophisticated techniques for matrix 
manipulation, specifically, using modern GPU technologies. Firstly, our matrix- 
based approach partly builds on the observation that most of the computation 
time when using ZRA is spent running a particular subroutine called the “attrac- 
tor” function, which we can parallelise. Secondly, we also rely on the observation 
that computations on matrices — which guide the search for the solution of parity 
games within our approach — can be efficiently parallelised using a combination 
of both algorithmic techniques for parallel computation and GPU devices. 


2 Preliminaries 


A parity game is two-player zero-sum infinite-duration game played over a finite 
directed graph G = (Vo, Vi, E, 2), where V = Vo UV, is a set of vertices/nodes 
partitioned into vertices Vp controlled by Player Even/0 and vertices V; con- 
trolled by Player Odd/1. Whenever a statement about both players is made, we 
may use the letter q (€ {0,1}) to refer to either player, and 1 — q to refer to 
the other player in the game. Without any loss of generality, we also assume 
that every vertex in the graph has at least one successor. Moreover, the function 
NQ : V — N is a labelling function on the set of vertices of the graph which 
assigns each vertex a priority. Intuitively, the way a parity game is played is by 
moving a token along the graph (starting from some designated node in V), with 
the owner of the node of which the token is on selecting a successor node in the 
graph. Because every vertex has a successor, this process continues indefinitely, 
producing a infinite sequence of visited nodes, and consequently an infinite se- 
quence of seen priorities. The winner of a particular play is determined by the 
highest priority that occurs infinitely often: Player 0 wins if the highest infinitely 
recurring priority is even, while Player 1 wins if the highest infinitely recurring 
priority is odd. Parity games are determined, which means that it always the 
case that one of the two players has a strategy (called a winning strategy) that 
wins against all possible strategies of the other player. Solving a parity game 
amounts to deciding, for every node in the game, which player has a winning 
strategy for the game starting in such a node. That is computing disjoint sets 
Wo C V and W, C V such that Player q has a winning strategy to win every 
play in the game that starts from a node in Wg, with q € {0,1}. 

Somewhat surprisingly, the best performing algorithm to solve parity games 
in practice is Zielonka’s Recursive Algorithm (ZRA [18]), which runs in exponen- 
tial time in the number of priorities, bounded by |V|. This algorithm is rather 
simple, and mostly relies on the computation of attractor sets, which are sets 
of vertices A = Attr,(X) inductively defined for each Player q as shown below 
— and used to computing both Wo and W recursively. Formally, the attractor 
function Attra : P(V) + P(V) for Player q, computes the attractor set of a 
given set of vertices U C V, and is defined inductively as follows: 
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Algorithm 1 Zielonka(G) 


if V = Ú then 
(Wo, Wi) — (9, 0) 
else 


m + max{Q(v) | ve V} 
q + m mod 2 
U + {ve V| Qv) =m} 
A + Attrg(U) 
(W6, Wi) + Zielonka(G \ A) 
if Wi_, =9 then 
(Wa, Wi-g) + (A U W3, 0) 
else 
B + Attrı—a(Wi—q) 
(Wo, Wi) + Zielonka(G \ B) 
(Wa, Wi-g) + (Wj, Wi_, U B) 
end if 
end if 
return (Wo, W1) 


Atir}(U) =U 
Attr?**(U) = Attr?(U) 
U {u € V; | w € Adtr? (U) : (u,v) € E} 
U {u € Vj, | Ww E V : (u,v) € E = v € Attri (U)} 
Attrg(U) = Attr\VI (U) 
As shown in Algorithm 1, ZRA [18] finds disjoint sets of vertices Wọ / W1 from 
which Player 0/1 has a winning strategy. Through the computation of attractor 
sets, the algorithm works by recursively decomposing the graph, finding sets 


of nodes that could be forced towards the highest priority node(s), and hence 
building the winning regions Wọ and W; for each player in the game. 


3 A matrix-based approach 


Experimental results from |7] motivated us to investigate whether ZRA can be 
improved in practice, since such an algorithm shows the best performance both in 
random games as well as in several structured games found in practical settings. 
This finding is complemented by the observation made in [6], that when running 
ZRA most of the time is spent in the computation of attractor sets, reported 
to be about 99% in [6] (with experiments considering random games only), and 
found to be of about 77% in our study (which considers larger classes of games). 

Our observation, and working hypothesis, not found in previous work [7,6], 
is that the basic ZRA can be highly optimised in practice if its main compu- 
tation component — the attractor set subroutine — is accelerated using efficient 
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Algorithm 2 Attr(A,t,q, g, 0) 
d + Ag 
t0 
while ||t # t’||, # 0 do 
tet 
ve At 
t + g0 ((0=q) O(v >0)+(0=(1-q)) © (v = d)) 
end while 
return t 


techniques for matrix manipulation, should a representation of the attractor 
set procedure was based on computations/operations on matrices encoding the 
attractor set subroutine in ZRA. This is precisely what we do in this section, 
which in turn makes our approach incredibly appropriate for an implementation 
in parallel using modern GPUs technologies for efficient matrix manipulation. 
To achieve a matrix-based encoding of ZRA, and in particular of its attractor 
set subroutine, we redefine the representation of the graph in terms of a sparse 
adjacency matrix A, a vector defining the ownership of every node o, and a vector 
w defining the priority of every node. Due to the potentially high computational 
cost of copying A, we maintain a vector g representing which nodes are still 
included in the game (a subgame being computed at that point in the algorithm), 
which is copied and updated as Zielonka’s algorithm recurses and decomposes 
the graph into ever smaller parts. As such, we are able to find d = Ag, a vector 
containing the maximum out-degree of every node. More specifically: 


— (A)i; = 1, if edge exists connecting i and j;(A);; = 0, otherwise; 
— (o); = q, if node i belongs to player q; 

= (w); = Qi); 

— (g); = 1, if node 7 is in the game; (g); = 0, otherwise. 


With these definitions in place, we can make the necessary modifications 
to the attractor function presented before — see Algorithm 2. The input/output 
vector t contains 1 at position (t); where a node i is part of the attractor set and 
0 otherwise. We thus define vectorised operations where if a vector is compared 
to another vector, then the comparisons are done element-wise. If a vector is 
compared to a scalar, then the scalar s is implicitly converted, s = s1. The © 
operator denotes the Hadamard product, which is used primarily as a Boolean 
And operation. The argument q is the player: 0 for Player 0 and 1 for Player 1. 


This algorithm works by first finding the number of outbound edges each 
node has (d + Ag), and at each iteration finding how many ways each node can 
enter the attractor set (v + At). It then finds nodes that q owns that may enter 
the attractor set ((o = g)©(v > 0)), and nodes that q do not own that are forced 
to enter the attractor set ((o = (1 — q)) © (v = d)). It then filters the nodes to 
include into the attractor set depending on which nodes are still included in the 
subgraph (g © (---)), and breaks the loop when there is no difference between t 
and t’. To illustrate this procedure, take as an example the graph below. 
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Algorithm 3 MatZielonka(A, g,o) 


if ||g||, = 0 then 
(Wo, Wi) + (0,0) 
else 
m + max(g © w) 
qm mod 2 
t + (w = m) 
t + Attr(A,t,q,g,0) 
(Wi, Wi) — MatZielonka(A, g — t,o) 
if ||Wy_,||, = 0 then 
(Wa, Wi-g) < (t + Wi, 0) 
else 
t + Attr(A, Wi_q,1-4,g,0) 
(Wi, Wi) — MatZielonka(A, g — t,o) 
(Wp, Wi-p) g (Wi, Wia F t) 
end if 
end if 
return (Wo, W1) 


10000 
10000 
A=ļ|11000 
10100 
01000 


For this example, assume that g = 1 and that we are computing the attractor 
set for the player that own the circle nodes, starting from the node with priority 7. 
After 1 (or some arbitrary number of iteration(s)), the current state is reached. 
Green nodes denote nodes included in the previous iteration’s attractor set, and 
yellow nodes denote nodes that will be included in this iteration. The calculations 
that may be performed are as follows. Define the adjacency matrix of the graph 
(A), the currently included nodes in the attractor set, t = (11000) , the 
ownership of every node, o = (0011 0)", and the degree — number of outbound 
edges — of every node, d = Ag = (1122 1)". Now, compute the number of 
edges from each node leading to an element in the current attractor set, that is, 
v = At = (1121 1)', and with that, update t, to obtain: t © (1 1101)’, 
which exactly represents the value of the attractor function one step later. Similar 
changes for ZRA in terms of the representation of the game must also be made, 
so that it becomes, fully, a matrix manipulation algorithm (Algorithm 3). 


The correctness of the algorithm remains unchanged from that of ZRA since 
our encoding into matrix operations is functional. Less clear is whether our 
algorithm retains the ZRA’s complexity, since using a functional mapping does 
not necessarily imply that the encoding (our representation) has the complexity 
of the encoded instance (i.e., the original problem). We study this question next. 
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3.1 Complexity 


Using the algorithms defined before, we derive a function R(d,n) that bounds 
the maximum number of recursive calls to ZRA, given a d number of distinct 
priorities and n nodes: R(d,n) = 1 + R(d— 1,n — 1) + R(d,n — 1). The 1 is the 
original call; the 1st recursive call is made with at least the vertex with the largest 
priority removed, and the second is made with at least one vertex removed. 
Hence, the construction above. There are two base cases R(d,0) = R(0,n) = 1. 
Firstly, we observer that based on the algorithms herein defined, we get: 


R(d,n) =1+ R(d—1,n — 1) + R(d,n — 1) 
=(n+1)+ [Rd -1,n— i)] 
Moreover, R(d,n) is then given by: f(d, n) = 25 ay) — 1. For the base 
case, when d = 1, we note that R(1,n) = (n + 1) + a “TRO, n— i) =2n+1 


and f(1,n) = DDA r G )-1=2(n+1)-1=2n+1= R(1,n), as required, for 
all n. For the inductive case, assume that R(d,n) = f(d,n), for d = k and all n. 


R(k+1,n)=(n4+1)+ > [R(k,n — i)] 


= (41) + en- 


-1255 ("7 ’) = 250 (7) <1 = Fe+ 1n) 


i=1 j=0 j=0 


Hence, the statement is true for the base case d = 1 and all n, while the 
inductive case d = k implies d = k + 1. Thus, by induction, R(d,n) = f(d,n) 
for d > 1 and all n. We now observe that the worst case number of calls occurs, 
as expected, at d = n where R(n,n) = 2+! — 1. Note that the complexity 
of a single call to MatZielonka has time complexity O(n?) (dominated by the 
complexity of calls to the matrix-based Attr subroutine!) and space complexity 
O(n), delivering worst-case complexities of O(n? - 2”) time and O(n - 2”) space. 

This result, negative in theory, is consistent with that of the worst-case com- 
plexity of ZRA, which indicates that our matrix-based encoding retains the same 
complexity properties of the original algorithm. More interestingly, is the fact 
that the quasi-polynomial extension of ZRA by Parys [16], and later improved 
by Lehtinen et al [13], can also be tackled with our approach while retaining the 
quasi-polynomial complexity. However, a matrix-based extension of the latter 
algorithm was not evaluated. Thus, its practical usefulness is yet to be studied. 


1 In practice, this is dominated by the complexity of performing matrix multiplication 
operations, which is just slightly larger than O(n?) and happens to be a vibrant topic 
of research recently due to improvements made through the use of Deep learning. 
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4 Implementation and evaluation 


Several factors influence the practical performance of a computational solution to 
a problem: for instance, (1) the algorithm used to solve the problem, (2) the pro- 
gramming language to implement the solution, (3) the concrete data structures 
used to represent it, and (4) the hardware where the solution is deployed. Our 
solution tries to optimise 1—4 using both lessons learnt from previous research 
and properties of our own matrix-based approach. Details are given later, but in 
short, in this section, five parity game solvers are implemented and evaluated?: 


Il our basic matrix-based approach, presented in the previous section; 
I2 its parallel implementation for deployment using GPU technologies; 
I3 the improved implementation of the attractor function of ZRA in [6]; 
14 the highly optimised C++ implementation of ZRA presented in [7]; 
I5 the unoptimised version of the above algorithm, also in [7]. 


Apart from (2), the five implementations above (I1-I5) will allow you to 
have a comprehensive evaluation of our approach, both against different versions 
of our own work and against previous research. The only aspect that all the 
solutions we present in this section have in common is the programming language 
used for implementation, which is C++, at present the language offering the most 
efficient practical implementation of parity games solutions; cf. [9,17,6,7]. We 
first present the characteristics of our matrix-based approach, deployed both as a 
sequential algorithm and as a parallelised procedure. After that, we will describe 
key features of the solutions originally developed elsewhere, and continue with 
the results of the evaluation using different types of parity games. 


Matrix-based approach. Whilst it is important to find performance from 
parallelisable operations, it is equally important to avoid the loss of performance 
from executing inefficient or slow operations. Specific algorithmic design choices 
such as maintaining a vector g to track nodes that are in or out of the graph 
are done to avoid otherwise necessary operations such as copying the adjacency 
matrix, which would otherwise be slow, especially when solving very large games. 

Additionally, all values in vectors and matrices are stored as single precision 
floating point values in practice. This is due to the software limitations of the 
Compute Unified Device Architecture (CUDA) [15] library, which are likely limi- 
tations of the underlying hardware itself. In particular, this limits the maximum 
out-degree of a node to 274, which corresponds to the number of bits in the 
mantissa of a single precision floating point number (23), plus one. Beyond this 
limit, the accuracy of the values computed in operations such as computing the 
maximum out-degree of a node with Ag would no longer be guaranteed, along 
with the correctness of the algorithm. We note that this limitation may be over- 
come by splitting a single node into multiple nodes, thus curbing the maximum 
out degree to an acceptable range. We do not do this for these experiments as 
this transformation has unknown impacts on the performance of the algorithm. 


? All files (implementations, experiments, input games, etc.) can be found in [1]. 
3 The description here applies to the first two solutions described above. 
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Algorithm 4 Attr(A,t,q, g, 0) 


while ||t 4 t’||, 40 do 
for i € (1..3) do 


end for 
end while 
return t 


The invocation of functions that run on the GPU (known as kernels) have 
an overhead, with the overhead duration varying somewhat between devices. As 
a consequence, tuning for a particular problem depends on the functions being 
executed and the GPUs themselves. Thus, there are periods where the device is 
idle, and this is a result of the overheads. Also note that in practice, it is usually 
faster to perform multiple iterations of the attractor computation as performing 
an iteration when the full attractor set has already been computed does not alter 
the results (Algorithm 4). This is because queueing multiple kernel invocations 
has the same overhead as calling one kernel alone. The main difference between 
our sequential and parallel implementations of the matrix-based method is the 
function computing attractor sets, which is as in Algorithm 2 in the sequential 
case, and as in Algorithm 4 in the parallel case. The code in ... is the same 
in both implementations, and the key difference is that we set the execution of 
the parallel implementation to make 3 kernel invocations per execution of the 
attractor function — which in lucky cases may require only 1 kernel invocation, 
while in unlucky cases may require more than 3 kernel invocations, increasing 
overheads; for our problem, we found that 3 kernel invocations was appropriate. 


We find that there is another possible point of optimisation as the time taken 
for the attractor computation would be approximately equal to cte+nto, where c 
is the number of attractor computations (the inside section of the for loop), n is 
the number of times the outer while loop will run, te is the time to run the for loop 
once, and to is the overhead incurred by switching execution from device (GPU) 
to host (CPU) as the condition is checked in the while loop. Ideally, c = C+1, and 
n = 1, where C is the (unknown) number of attractor computations required. 
Our implementation loops the inner for loop an arbitrary constant number of 
times (3 times here). As such, C +1 < c < C +3, and n = T$]. 


Importantly, requirements for the efficient parallelisation of the algorithm on 
the GPU require us to select the ‘Naive attractor’ implementation as the under- 
lying algorithm (Algorithm 2) to be parallelised (leading to Algorithm 4) rather 
than the ‘Improved attractor’ implementation in [6]. The concepts of ‘Naive’ and 
‘Improved’ attractors are presented by Arcucci et al in [6]. In short, the ‘Naive’ 
attractor loops over each node and checks if it can be included in the attrac- 
tor set, and repeats this until no further nodes can be added. The ‘Improved’ 
attractor starts from the original attractor set, performing backpropagation on 
their inbound edges to find other nodes that may be included in the set. 
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GPU deployment. Our GPU implementation works by parallelising the “attract” 
operation.* Whilst the sequential version may be executed as such: 


— (Loop 1) While attracting new nodes... 
— (Loop 2) For each node, check if it can be included in the attractor set. 


And the runtime operations may look like: 


— While attracting new nodes... 
e Can node 1 be included in the attractor set? 
Oey 


e Can node N be included in the attractor set? 
— If attracted new nodes, repeat loop. Else break. 


Performance is found through the inner loop being efficiently parallelised on 
the GPU. Additional specifics include the following GPU deployment features. 
When asking “Can node X be included ...?”, the computation taking place is: 


— Let J be the set of nodes in the current attractor set. 

— Let K be the set of nodes that X can move to. 

— If X is on the “friendly” team, and KN J #9, then J+ JU{X}. 
— If X is on the “enemy” team, and K C J, then J+ JU{X}. 


Key to our approach is that these operations are efficiently parallelised through 
means of matrix multiplication operations on the GPU. It is done as such: 


— Compute t = A1. Hence, t; is the number of nodes node 7 can move to. 

— Let j be a vector of size N (where N is the size of the parity game), such 
that j; = 1 if and only if node 2 is in the current attractor set. Default 0. 

— Let A be an adjacency matrix (usually, a sparse matrix) of the parity game. 

— Compute the vector k = Aj. Hence, the value k; in the vector is the number 
of nodes node 7 can move to and that are in the current attractor set. 

— Then, for each node i, if it is on the friendly team, and k; Æ 0, then j; = 1; 
otherwise, if it is on the enemy team, and k; = t;, then j; = 1. 


Note we convert the previous logic on sets to suit the new form using vectors: 


Improved attractor implementation by Arucci et al [6]. The third parity 
game solver we evaluate is a custom, C++, implementation of the ZRA using 
the ‘Improved attractor’ algorithm in [6], originally implemented in JAVA there. 


ZRA implementations in Oink [7]. The fourth and fifth implementations 
we evaluate and compare against are the most highly optimised implementation 
of ZRA developed in [7], and its unoptimised version — without pre-processing 
routines. We include this implementation since our matrix-based (‘Naive’) im- 
plementation is not optimised in terms of the pre-processing routines used for 
implementation. These solvers in Oink are referred to as z1k and uzlk in [7]. 
We note that the parallel implementation of this algorithm is not included since 
in [7| is shown that it usually is outperformed by z1k, which we include here. 


4 A very different approach, leading to a very different GPU deployment is done in [10]. 
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4.1 Evaluation 


The implementations evaluated in this paper were tested on a wide repository of 
parity games, and against state-of-the-art parity game solvers in the literature. 
The games used for performance evaluation include the suite by Keiren [12] (of 
games representing model checking and equivalence checking problems) and an 
additional set of variably sized random games generated by PGSolver [9].5 

We evaluate the performance in terms of solve time of each of the solvers and 
for each of the games. As it is common practice when evaluating different solvers 
for parity games, the overheads incurred due to startup and game loading are not 
included; this is done in order to obtain numbers that estimate only the running 
time of the algorithms, and nothing else. With the same aim, we ensured that at 
most one solver is running at any time, with CPU utilisation not exceeding more 
than one core. Finally, in order to allow for a fair comparison of running times 
only — rather than combining such results with the robustness of the algorithms 
— we measured the time solving an instance only in case all implementations suc- 
cessfully compute a solution. This allows for a fairer comparison with respect to 
runtime performance purely, because failing a game usually implies an extremely 
disproportionately (and arbitrary) high runtime. Such failures include timeouts 
(at 5 minutes) or being unable to load the game, sometimes due to factors hav- 
ing little to do with the running time of the algorithms. Our experiments were 
conducted in the Google Cloud Platform (GCP) using a T4 nl-highmem-2.° 


Profile of the input parity games. Our study includes more than 2000 parity 
games, with sizes ranging from only a few dozens of states to games with millions 
of states. Both nodes’ out-degrees and number of distinct priorities also cover 
a wide range of dimensions. However, both random games and structured ones 
(model checking and equivalence checking) typically are represented by sparse 
graphs, a feature that we will leverage for implementation purposes. 


5 Analysis of results 


As can be seen from Tables 1, 2, and 3, we evaluate the main five implemen- 
tations, all of them following the ZRA philosophy, using two types of parity 
games: structured and random. Both types of benchmarks are as in |7] and [6], 
arguably, the two best implementations of ZRA. The focus of this evaluation 
is to understand the usefulness and scalability of the ‘GPU matrix’ algorithm, 
which is the one embodying more cleanly our working hypothesis, namely, that 
the combination of a matrix-based representation of ZRA and the use of modern 
GPU technologies can outperform the state of the art in the design of algorithms 
for parity games — a hypothesis for which we provide strong evidence here. 


5 These random games were generated using parameters that are identical to those of 
the random games in the ‘PGSolver’ collection in the suit of benchmarks by Keiren. 

6 In order to compare performance in different hardware (GPU) architectures, we use 
a different technology for experiments presented in a forthcoming section. 
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Model checking Equiv checking Random games 
Implementation Time P/F Time P/F Time P/F 
GPU matrix 94 313/0 332 209/7 20 1750/0 
Naive (matrix) attractor 566 313/0 2190 216/0 88 1750/0 
Improved attractor 212 313/0 1310 216/0 113 1750/0 
Oink’s z1k 143 313/0 578 216/0 39 1750/0 
Oink’s uzlk 150 313/0 917 216/0 69 1750/0 


Table 1: Times are in milliseconds (ms) representing the average time taken to 
solve games that all implementations passed (i.e., if any implementation fails 
to solve a game, the game is excluded from the time average of all five solvers, 
including an additional GPU implementation on an RTX2060S, presented later). 
Failures occur with a small number of large equivalence checking games only. 
Failures include a few timeouts (at 5 mins), and usually being unable to load the 
game in memory due to hardware limitations posed by the GPU architectures. 
Columns P/F show the number of games passed/failed for every type of game. 


Model checking Equiv checking Random games 
Implementation Time P/F Time P/F Time P/F 
GPU matrix 814 33/0 2612 29/7 283 50/0 
Naive (matrix) attractor 4565 33/0 17610 36/0 1059 50/0 
Improved attractor 1832 33/0 10411 36/0 1446 50/0 
Oink’s z1k 1263 33/0 4568 36/0 547 50/0 
Oink’s uzlk 1316 33/0 7332 36/0 952 50/0 


Table 2: Results in this table are formatted as in Table 1. In this table, we report 
the performance (average time in milliseconds taken to solve a single game) for 
the 5 algorithms on large (>1M nodes) parity games only. 


Model checking Equiv checking Random games 
Implementation Time P/F Time P/F Time P/F 
GPU matrix 9 280/0 22 180/0 12 1700/0 
Naive (matrix) attractor 95 280/0 172 180/0 59 1700/0 
Improved attractor 21 280/0 119 180/0 74 1700/0 


Oink’s z1k 11 280/0 56 180/0 24 1700/0 
Oink’s uz1k 13 280/0 77 180/0 43 1700/0 
Nodes 


Implementation 4K 8K 12K 16K 20K 40K 80K 320K 640K 
GPU matrix 2 2 2 #5 11 43 78 


1 1 
Naive (matrix) attractor 1 1 2 4 5 17 37 208 469 
Improved attractor 1 1 3 5 7 19 45 264 557 
Oink’s z1k 1 1 2 3 4 7 15 76 186 
Oink’s uzlk 1 2 3 4 #5 11 25 142 354 


Table 3: Results in this table are formatted as in Table 1. In this table, we 
report the performance (average time in milliseconds taken to solve a single 
game) for the 5 algorithms on “small” (<1M nodes) parity games only: results 
for structured and random games appear in the top table and for random games 
(detailed) at the bottom. In the bottom table, there are 200 games per column, 
apart from column 640K which has 100 games; there are no failures. 
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The results above also show that going from the sequential version of our 
approach, ‘Naive (matrix) attractor’ to its parallel implementation using GPU 
technologies finds significant improvements. These two main “internal” results 
are then compared with the state of the art in the algorithmic design of solu- 
tions based on ZRA, namely, using the improved attractor in [6] and using the 
highly optimised procedure z1k in Oink [7], which even outperforms its own 
parallel implementation; cf. [7]. Finally, the unoptimised version in Oink of this 
procedure, uzlk, is also included simply because our matrix-based procedure 
does not contain any of the pre-processing routines that differentiate z1k from 
uzlk. Thus, in a way, uzlk provides results for a somewhat fairer comparison. 


GPU matrix vs Naive (matriz) attractor. Results in all tables show that the 
parallel implementation using GPU technologies outperforms its own sequential 
implementation (‘Naive matrix attractor’) by several orders of magnitude, with 
some exceptions, usually ranging from 5 times faster in some cases (e.g., model 
checking of large games) to more than 10 times faster (e.g., model checking 
of small games). This, we believe, is due to the fact that the bigger the input 
instances to be analysed the more any losses in the associated overheads of 
running the procedure in parallel are compensated later on. A trend going in 
that direction can be observed in detail when comparing the performance of 
these two algorithms over small random games. But, in any case, our matrix- 
based approach is always at least as good as its sequential implementation. 


GPU matrix vs Improved attractor. The results show that the parallel matrix- 
based approach can outperform the improved attractor procedure by Arcucci et 
al [6] by 2-7 orders of magnitude, depending on the type of game being solved, 
and with the best results obtained when solving random games, whether large or 
small. However, the sequential version of ‘GPU matrix’, that is, the Naive imple- 
mentation, usually is twice slower than the improved attractor implementation 
in structured games. Contrarily, even the (sequential) Naive implementation of 
the matrix-based method outperforms the improved attractor procedure over 
random games, being about 30% overall in that case. When looking at all the 
tables of results together, one can see that this is in fact an indicator of the fact 
that the improved attractor approach performs somewhat poorly over random 
graphs, at least when compared to its performance over structured games. 


GPU matrix vs Oink. Even thought the GPU matrix-based implementation out- 
performs Oink’s z1k, it usually does it only by a 1.5 to 2.0 factor, with the GPU 
implementation performing more efficiently over (large) random games than over 
structured ones. This result actually speaks very highly of the optimised sequen- 
tial implementation of ZRA. However, as shown in [7], z1k performs even better 
than its own parallel implementation (called z1k-8 in [7]) when solving model 
checking parity games (by a very small margin) and when solving random games, 
where it is nearly twice faster; cf. Table 3 of [7]. Only when solving equivalence 
checking parity games z1k-8 outperforms z1k, but only by about a 13% margin. 
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In contrast, the GPU implementation here outperforms z1k by more than a 70% 
margin, and is even twice faster when solving small equivalence checking games. 

However, as we can see from all tables, the GPU matrix-based implementa- 
tion has some failures (running timeout or failure to upload the game in mem- 
ory, mainly due to their size), while the improved attractor method never fails 
in the considered set of benchmarks. This indicates that in this particular case, 
there may be a choice to be made between some potentially marginal gain in 
efficiency and more reliability offered by z1k. On the other hand, z1k clearly out- 
performs the sequential (Naive) implementation of the matrix-based approach, 
with better efficiency going from twice faster when solving random games to 
about four times faster when solving structured games. Regarding performance 
against Oink’s uzlk, all analyses above remain similar, only that a better factor 
is usually obtained in favour of the GPU matrix-based approach. 


Improved attractor vs Oink’s z1k. Despite these two procedures being originally 
developed previously, we would like to comment on their comparative perfor- 
mance, for the sake of completeness of the analysis. As can be seen from our 
results, both offer the same reliability as they do not fail to solve any instance. 
Regarding runtime efficiency, we can observe that, on average, Oink’s z1k imple- 
mentation tends to be 1.5 to 3.0 times faster than the improved attractor method, 
with the worst /best comparative performance being enacted when solving model 
checking/random parity game instances, and in that way making zlk perhaps 
the most efficient sequential implementation of ZRA currently available in the 
literature, and being outperformed only when a parallel approach is considered. 


6 Special cases 


In this section, we analyse in more details two special cases of our results: per- 
formance when solving large parity games and performance on random games. 


6.1 Solving large parity games 


For the purposes of this section, a large parity game is a game with more than 
1 million nodes. Our results show that for games that are not large (Table 3), 
all solvers may be regarded as running efficiently from a human perspective, 
with some random games with more than 500K nodes being solved in about 
half a second by the slowest implementation on random games (the improved 
attractor implementation). In most other instances, solutions may be obtained 
in just a few milliseconds. For instance, model checking parity games in the suite 
of benchmarks can be solved in less than 0.1 minutes by any studied solver, and 
even in less than 10 milliseconds on average using the parallel GPU matrix- 
based approach, with Oink implementation taking virtually the same time (just 
a little more than 10 milliseconds on average). Then, the real challenge when 
solving parity games in practice is solving large parity games, where the relative 
performances between different solvers can be much better exposed (Table 2). 
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Our results show (Tables 1 and 2) that, despite the raw data being different 
in about 9 orders of magnitude, nearly the same relative performance is ob- 
tained when looking at performance over all games with respect to performance 
over large games only, which account for no more than 15% of the games for 
equivalence checking games, 10% for model checking games and less than 5% for 
random games. This result indicates that in order to evaluate the performance 
of parity games solvers in practice, one should better focus on large games only. 
As the data shows, in that case that parallel GPU matrix-based approach out- 
performs the second-best technique by, approx., a 1.5-2.0 factor, and its own 
sequential implementation by a factor of 4 to 5, in each case, depending on the 
type of parity game under consideration. The analysis holds across all solvers. 


6.2 Solving random parity games 


Random parity games are a common benchmark for parity games solvers, be- 
ing the focus of the study on [6]. Our detailed experiments on random parity 
games show that the parallel GPU implementation of the matrix-based approach 
is comparable to the parallel implementation of the improved attractor imple- 
mentation in [6] (see Table 3 there), in the sense that a similar relative gain 
in performance is achieved, overall, performing about 3.5-4.0 times faster over 
random games of up to 20K nodes. The gain in performance increases in our case 
when considering larger random graphs, perhaps indicating that our approach 
may be more scalable in terms of running time; however, in [6], only results on 
random games of up to 20K nodes are presented. We note that, in this case, only 
by changing the programming language of choice (JAVA in [6] and C++ here), 
performance is improved going from games of 20K size being solved in more than 
5 seconds to the same type of games being solved in just 7ms on average here. 


7 Alternative implementations 


In this section, we explore two alternative implementations, one focused on a 
change of programming environment and another one based on a change of com- 
puter architecture. Our results show that while the former is well outperformed 
by the original C++ implementation, the latter shows even better performance 
than the already reported can be achieved when using other GPU technologies. 


A MATLAB implementation. Given its facility to perform matrix operations, 
we investigated a MATLAB of our matrix-based approach to understand if it 
could perform better than our original C++ implementation. The results were 
negative. The MATLAB implementation of our approach, although simple, per- 
formed significantly worse than other methods, including our own using C++. A 
summary of the results, which require little discussion, can be found in Table 4. 


Using a different GPU technology. We conducted experiments using the exact 
same implementation of the GPU matrix solver (run on a GCP) on a different 
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Model checking Equiv checking Random games 
Implementation Time P/F Time P/F Time P/F 
GPU matrix 94 313/0 332 209/7 20 1750/0 
Naive (matrix) attractor 566 313/0 2190 216/0 88 1750/0 
MATLAB matrix 2462 311/2 2338 198/18 3496 1750/0 


Table 4: Results in this table are formatted as in Table 1. We report results on all 
games, and in each case, independently, remove the time of unsolved instances. 


Model checking Equiv checking Random games 
Implementation Time P/F Time P/F Time P/F 
GPU matrix (RTX2060S) 63 313/0 203 205/11 25 1750/0 
GPU matrix 94 313/0 332 209/7 20 1750/0 


Table 5: Results in this table are formatted as in Table 1. We report results on 
all games, which show an improvement of a 1.5x factor for structured games, 
while performing approximately 25% slower over random parity games. 


GPU architecture, namely, on an RTX2060 Super (Ryzen 5 3600). We found 
that by simply changing to this alternative hardware specification, the results 
on all types of games were significantly better, as shown in the Table 5. 


8 Concluding remarks and related work 


We have shown that a new method for solving parity games using a matrix-based 
approach can outperform the state-of-the-art techniques, both sequential and 
parallel, currently available. As such, our results become a new point of compari- 
son when evaluating modern solvers for parity games. Previous research [7,6,17,9] 
has shown that ZRA is potentially the best performing algorithm to solve parity 
games in practice, and here we provide more evidence that this is indeed the 
case. We also give evidence that C++ implementations for this task are hardly 
ever outperformed in practice. Finally, we also show that choosing the right com- 
puter architecture is key to achieve optimal performance, and in particular that 
in the case of modern GPU technologies, such a choice can make a significant 
difference in practice — in our study, leading to the development of the, as of 
today, most efficient parallel implementation/solver for parity games. 
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Abstract. Various techniques have been proposed to accelerate explicit- 
state model checking with GPUs, but none address the compact storage 
of states, or if they do, at the cost of losing completeness of the checking 
procedure. We investigate how to implement a tree database to store 
states as binary trees in GPU memory. We present fine-grained parallel 
algorithms to find and store trees, experiment with a number of GPU- 
specific configurations, and propose a novel hashing technique, called 
Cleary-Cuckoo hashing, which enables the use of Cleary compression on 
GPUs. We are the first to assess the effectiveness of using a tree database, 
and Cleary compression, on GPUs. Experiments show processing speeds 
of up to 131 million states per second. 


Keywords: Explicit state space exploration, finite-state machines, GPU. 


1 Introduction 


Major advances in computation increasingly need to be obtained via parallel soft- 
ware, as Moore’s Law is ending [30]. In the last decade, GPUs have been success- 
fully applied to accelerate various computations relevant for model checking, such 
as probability computations for probabilistic model checking [8, 25,48], counter- 
example construction [54], state space decomposition [52], parameter synthesis 
for stochastic systems [12], and SAT solving [34-38, 40, 43, 56,57]. VoxLocicA- 
GPU applies model checking to analyse (medical) images [9]. 

In the earliest work on GPU explicit state space exploration, GPUs performed 
part of the computation, specifically successor generation [18,19] and property 
checking once the state space has been generated [5]. This was promising, but 
the data copying between main and GPU memory and the computations on 
the CPU were detrimental for performance. The first tool that performed the 
entire exploration on a GPU was GPUEXPLORE [33, 50,51, 53]. It was later 
extended to support LTL model checking [49]. A similar exploration engine was 
later proposed in [55]. An approach that applied a GPU to explore the state 
space of PROMELA models, i.e., the models for the SPIN model checker [21], was 
presented in [6]. This was later adapted to the swarm checker GRAPPLE [16], 
which can efficiently explore very large state spaces, but at the cost of losing 
completeness. Finally, the model checker PARAMOC for pushdown systems was 
presented in [46, 47]. 


© The Author(s) 2023 
S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 684—703, 2023. 
https://doi.org/10.1007/978-3-031-30823-9_35 
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The above techniques demonstrate the potential for GPU acceleration of state 
space exploration and (explicit-state) model checking, being able to accelerate 
those procedures tens to hundreds of times, but they all have serious practical 
limitations. Several limit the size of state vectors to 64 bits [6,55] or the size 
of transition encodings to 64 bits [46,47]. GPUEXPLORE does not efficiently 
support models with variables [50,53]. When adding variables, the amount of 
memory needed rapidly grows, due to the growing input model and inefficient 
state storage. GRAPPLE requires less memory, but uses bitstate hashing. This 
rules out the ability to detect that all reachable states have been explored, which 
is crucial to prove the absence of undesired behaviour. PARAMOC verifies push- 
down systems, but does not support concurrency, and abstracts away data. 


Contributions. We propose how to perform memory-efficient complete state 
space exploration on a GPU for concurrent Finite-State Machines (FSMs) with 
data. To make this possible, we are the first to investigate the storage of binary 
trees in GPU hash tables, propose new algorithms to find and store trees in a 
fine-grained parallel fashion, experiment with a number of GPU-specific config- 
urations, and propose a novel hashing technique called Cleary-Cuckoo hashing, 
which enables the use of Cleary compression [13,15] on GPUs. To achieve this, we 
have to tackle the following challenges: 1) CPU-based algorithms are recursive, 
but GPUs are not suitable for recursion, and 2) accessing GPU global memory, 
in which the hash tables reside, is slow. This work marks an important step 
to pioneer practical GPU accelerated model checking, as it can be extended to 
checking functional properties of models with data, and paves the way to inves- 
tigate the use of Binary Decision Diagrams [29] for symbolic model checking. 

The structure of the paper is as follows. In Section 2, we discuss related 
work on GPU hash tables. Section 3 presents background information on GPU 
programming, and Section 4 contains an overview of the state space exploration 
engine. Section 5 addresses the challenges when designing a GPU tree table, and 
presents our new algorithms. Experimental results are given in Section 6, and in 
Section 7, conclusions and our future work plans are discussed. 


2 Related Work 


An overview of related work on GPU acceleration of model checking is given in 
Section 1. In the current section, we focus on hash tables [14] for the GPU. In 
explicit state space exploration, states are typically stored in a hash table. Such 
a table is often implemented as an array, where the elements represent the hash 
table buckets. A recent survey of GPU hash tables [31] identifies that when using 
integer data items and unordered insertions and queries, Cuckoo hashing [41] 
is (currently) the best option, compared to techniques such as chaining [3] or 
robin hood hashing [20], and the Cuckoo hashing of [1] is particularly effective. 
In Cuckoo hashing, collisions, i.e., situations where a data item e is hashed to 
an already occupied bucket, are resolved by evicting the encountered item e’, 
storing e, and moving e’ to another bucket. A fixed number of m hash functions 
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is used to have multiple storage options for each item. Item look-up and storage 
is therefore limited to m memory accesses, but can lead to chains of evictions. 
In [1], it is demonstrated that with four hash functions, a hash table needs 
around 1.25N buckets to store N items.! Recent research [4] has demonstrated 
that using larger buckets, spanning multiple elements, that still fit in the GPU 
cache line is beneficial for performance, and increases the average load factor, 
i.e., how much the hash table can be filled until an item cannot be inserted, to 
99%. We address this in detail in Section 3. However, in [4], an older NVIDIA 
GPU of the VOLTA architecture was used (2017), while more recent GPUs are 
supposedly less susceptible to optimisations exploiting the cache line. In this 
work, we experimentally assess this for hash table buckets. 

Besides buckets, we also consider Cuckoo hashing as used in [1,4], but we 
are the first to investigate the storage of binary trees, and the use of Cleary 
compression to store more data in less space. Libraries offering GPU hash tables, 
such as [23], do not offer these capabilities. Furthermore, we are the first to 
investigate the impact of using larger buckets for binary tree storage embedded 
in a state space exploration engine. 

The model checker GPUEXPLORE |11, 50,53] uses multiple hash functions 
to store a state. State evictions are never performed, as each state is stored in 
a sequence of integers, making it not possible to store states atomically. This 
can lead to storing duplicate states, which tends to be worsened when states 
are evicted, making Cuckoo hashing not practical [51]. Besides compact state 
storage, a second benefit of using trees with each node being stored in a single 
integer is that it allows arbitrarily large states to be stored atomically, i.e., a 
state is stored the moment the root of its tree is stored. 

Because we store trees, with the individual nodes referencing each other, 
we do not consider alternative storage approaches, such as using a list that 
is repeatedly sorted, even though Alcantara et al. identified that using radis- 
sort [32] is competitive to hashing [1]. 


3 GPU programming 


CUDA? is a programming interface that enables general purpose programming 
for a GPU. It has been developed and continues to be maintained by NVIDIA 
since 2007. In this work, we use CUDA with C++. Therefore, we use CUDA 
terminology when we refer to thread and memory hierarchies. 

The left part of Fig. 1 gives an overview of a GPU architecture. For now, 
ignore the bold-faced words and the pseudo-code. A GPU consists of a finite 
number of streaming multiprocessors (SM), each containing hundreds of cores. 
For instance, a Titan RTX, which we used for this work, has 72 SMs containing 
together 4,608 cores. A programmer can implement functions, named kernels, to 


' This refers to the single-level version of their Cuckoo hashing [1], which we consider 
in this work. Their two-level version is more complex and less efficient. 
2 https: / /developer.nvidia.com/cuda-zone. 
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SM 1 : while there are unexplored states: 
o - select set of unexplored states S 
5 (mark them explored) 
B - for all successors s' of all s € S: 
- store s'in the (local) state cache 
7 - sync. cache with global memory 


global memory (state storage) 


Fig. 1: State space exploration on a GPU architecture. 


be executed by a predefined number of GPU threads. Parallelism is achieved by 
having these threads work on different parts of the data. 

When a kernel is launched, threads are grouped into blocks, usually of a size 
equal to a power of two, often 512 or 1,024. Each block is executed by one SM, but 
an SM can interleave the execution of many blocks. When a block is executed, the 
threads inside are scheduled for execution in smaller groups of 32 threads called 
warps. A warp has a single program counter, i.e., the threads in a warp run in 
lock-step through the program. This concept is referred to as Single Instruction 
Multiple Threads (SIMT): each thread executes the same instructions, but on 
different data. The threads in a warp may also follow diverging program paths, 
leading to a reduction in performance. For instance, if the threads of a warp 
encounter an if C then P1 else P2 construct, and for some, but not all, C 
holds, all threads will step through the instructions of both P1 and P2, but each 
thread only executes the relevant instructions. 

GPU threads can use atomic instructions to manipulate data atomically, such 
as a compare-and-swap on 32- and 64-bit integers: ATOMICCAS(addr, compare, 
val) atomically checks whether at address addr, the value compare is stored. If 
so, it is updated to val, otherwise no update is done. The actual value read at 
addr is returned. 

There are various types of memory on a GPU. The global memory is the 
largest of these, 24 GB in the case of the Titan RTX, and is used to copy data 
between the host (CPU-side) and the device (GPU-side). It can be accessed by 
all GPU threads, and has a high bandwidth, but also a high latency. Having 
many threads executing a kernel helps to hide this latency; the cores can rapidly 
switch contexts to interleave the execution of multiple threads, and whenever 
a thread is waiting for the result of a memory access, the core uses that time 
to execute another thread. Another way to improve memory access times is by 
ensuring that the accesses of a warp are coalesced: if the threads in a warp try to 
fetch a consecutive block of memory in size not larger than the cache line (128 
bytes for a Titan RTX), then the time needed to access that block is the same 
as the time needed to access an individual memory address. 

Other types of memory are shared memory and registers. Shared memory is 
fast on-chip memory with a low latency, that can be used as block-local memory; 
the threads of a block can share data with each other via this memory. In a Titan 
RTX, each block can use up to 49,152 bytes of shared memory. Register memory 
is the fastest, and is used to store thread-local data. It is very small, though, 
and allocating too much memory for thread-local variables may result in data 
spilling over into global memory, which can dramatically limit the performance. 
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Finally, the threads in a warp can communicate very rapidly with each other 
by means of intra-warp instructions. There are various instructions, such as 
SHUFFLE to distribute register data among the threads and BALLOT to distribute 
the results of evaluating a predicate. Since CUDA 9.0, threads can be partitioned 
into cooperative groups. If these groups have a size that completely divides the 
warp size, i.e., it is a power of two smaller than or equal to 32, then the threads 
in a group can use intra-warp instructions among themselves. 

In Section 2, we mentioned the use of buckets in a GPU hash table. When a 
hash table is divided into buckets, each containing 1 < n < 32 elements, that still 
fit in the cache line, then cooperative groups of n threads each can be created, 
and the threads in a group can work together for the fetching and updating 
of buckets. This results in more coalesced memory accesses and reduces thread 
divergence. However, it also means that fewer tasks can be performed in parallel, 
and starting with the TURING architecture (2018), which the Titan RTX is built 
on, NVIDIA has been working on making computations less reliant on coalesced 
memory accessing. 


4 GPU state space exploration 


Stco. For this work, we extended the state space exploration engine of GPU- 
EXPLORE 2.0 [53] to support models of finite-state concurrent systems written 
in the Simple Language of Communicating Objects (SLCO), version 2.0 [44]. An 
SLCO model consists of a finite number of FSMs. The FSMs can communicate 
via globally shared variables, and each FSM can have its own local variables. 
Variables can be of type Bool, Byte and (32-bit) Integer, and there is support 
for arrays of these types. We refer with (system) states s,s’,... to entire states of 
the system, and with FSM states o,o',... to the states of an individual FSM. A 
system state is essentially a vector, containing all the information that together 
defines a state of the system, i.e., the current states of the FMSs and the values 
of the variables. 

An FSM transition tr = o Æ o’ indicates that the FSM can change state 
from o to o’ iff the associated statement st is enabled. A statement is either an 
assignment, an expression or a composite. Each can refer to the variables in the 
scope of the FSM. An assignment is always enabled, and assigns a value to a 
variable, an expression is a predicate that acts as a guard: it is enabled iff it eval- 
uates to true. Finally, a composite is a finite sequence of statements sto; .. .; Stn, 
with sto being either an expression or an assignment, and st,,..., st, being as- 
signments. A composite is enabled iff its first statement is enabled. A transition 
tr = o Ž 0’ can be fired if it is enabled, which results in the FSM atomically 
moving from state o to state o’, and any assignments of st being executed in 
the specified order. When tr is fired while the system is in a state s, then after 
firing, the system is in state s’, which is equal to s, apart from the fact that o 
has been replaced by o’, and the effect of st has been taken into account. We 
call s’ a successor of s. 
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The formal semantics of SLCO defines that each transition is executed atom- 
ically, i.e., cannot be interrupted by the execution of other transitions. The 
FSMs execute concurrently, using an interleaving semantics. Finally, the FSMs 
may have non-deterministic behaviour, i.e., at any point of execution, an FSM 
may have several enabled transitions. 


State space exploration. Given an SLCO model with n FSMs, first, CUDA 
functions f1,... fn are generated, using a new code generator, that take as input 
a state s, and produce as output the successors of s which can be reached by 
firing a transition enabled in s of the it FSM. When the state space is generated, 
each state s can be analysed in parallel by n threads t1,...,tn, where each t; 
executes f; to obtain some of the successors of s. 


Fig. 1 presents how the different components of the state space exploration 
engine map on a GPU. We explain how the engine works insofar is needed. 
For more details, we refer the reader to [50, 51,53]. Even though the type of 
input model has changed, as GPUEXPLORE only supports models without data 
variables, the core of the engine has remained the same. 


In the global memory, a large hash table (we call it G) is maintained to store 
the states visited so far. At the start, the initial state of the input model is stored 
in G. Each state in G has a Boolean flag new, indicating whether the state has 
already been explored, i.e., whether or not its successors have been constructed. 


On the right in Fig. 1, the state space exploration algorithm is explained from 
the perspective of a thread block. While the block can find unexplored states in 
G, it selects some of those for exploration. In fact, every block has a work tile 
residing in its shared memory, of a fixed size, which the block tries to fill with 
unexplored states at the start of each exploration iteration. Such an iteration is 
initiated on the host side by launching the exploration kernel. States are marked 
as explored when added by threads to their tile. 


Next, every block processes its tile. For this, each thread in the block is 
assigned to a particular state/FSM combination. Each thread accesses its desig- 
nated state in the tile, and analyses the possibilities for its designated FSM to 
change state, as explained before. Hence, the threads in a group can generate 
successors for a single state in parallel. 


The generated successors are stored in a block-local state cache, which is a 
hash table in the shared memory. This avoids repeated accessing of global mem- 
ory, and local duplicate detection filters out any duplicate successors generated 
at the block-level. Once the tile has been processed, the threads in the block 
together scan the cache once more, and store the new states in G if they are 
not already present. When states require no more than 32 or 64 bits in to- 
tal (including the new flag), they can simply be stored atomically in G using 
compare-and-swap. However, sufficiently large systems have states consisting of 
more than 64 bits. In this paper, we therefore focus on working with these larger 
states, and consider storing them as binary trees. 
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Fig. 2: An example of storing state vectors as binary trees. 


5 A Compact GPU Tree Database 


5.1 CPU Tree Storage 


The number of data variables in a model, and their types, can have a drastic effect 
on the size of the states of that model. For instance, each 32-bit integer variable in 
a model requires 32 bits in each state. As the amount of global memory on a GPU 
is limited, we need to consider techniques to store states in a memory-efficient 
way. One technique that has proven itself for CPU-based model checkers is tree 
compression [7], in which system states are stored as binary trees. A single hash 
table can be used to store all tree nodes [27]. Compression is achieved by having 
the trees share common subtrees. Its success relies on the observation that states 
and their successors tend to be different in only a few data elements. In [27], 
it is experimentally assessed that tree compression compresses better than any 
other compression technique identified by the authors for explicit state space 
exploration. They observe that the technique works well for a multi-threaded 
exploration engine. Moreover, they propose an incremental variant that has a 
considerably improved runtime performance, as it reduces the number of required 
memory accesses to a number logarithmic in the length of the state vector. 

Fig. 2 shows an example of applying tree compression to store four state 
vectors. The black circles should be ignored for now. Each letter represents a 
part of the state vector that is k bits in length. We assume that in k bits, also 
a pointer to a node can be stored, and that each node therefore consists of 2k 
bits. The vector <A,B,C,D,E> is stored by having a root node with a left leaf 
sibling <A,B>, and the right sibling being a non-leaf that has both a left leaf 
sibling <C,D>, and the element E. In total, storing this tree requires 8k bits. To 
store the vector <A’,B,C’,D,E>, we cannot reuse any of these nodes, as <A’,B> 
and <C’,D> have not been stored yet. This means that all pointers have to be 
updated as well, and therefore, a new root and a new non-leaf containing E are 
needed. Again, 8k bits are needed. For <A,B’,C,D,E’>, we have to store a new 
node <A,B’> and a new root, and a new non-leaf storing E’, but the latter can 
point to the already existing node <C,D>. Hence, only 6k bits are needed to 
store this vector. Finally, for <A’,B,C,D,E’>, we only need to store a new root 
node, as all other nodes already exist, resulting in only needing 2k bits. It has 
been demonstrated that as more and more state vectors are stored, eventually 
new vectors tend to require 2k bits each [26, 27]. 

To emphasise that GPU tree compression has to be implemented vastly dif- 
ferently from the typical CPU approach, we first explain the latter, and the 
incremental approach [27]. Checking for the presence of a tree and storing it if 
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Algorithm 1: Tree-based Find-or-put, CPU version. 


1 function FINDORPUT-CPU(node_t* G, node_t node): 

if HAS-LEFT-SIBLING(node) and IS-UPDATED(LEFT-SIBLING(node)) then 
node.left +— FINDORPUT-CPU(G, LEFT-SIBLING(node)) 

if HAS-RIGHT-SIBLING(node) and _IS-UPDATED(RIGHT-SIBLING(node)) then 
node.right + FINDORPUT-CPU(G, RIGHT-SIBLING(node)) 

addr +— STORE(G, node) 

return addr 


Noap own 


not yet present is typically done by means of recursion (outlined by Alg. 1). For 
now, ignore the red underlined text. The STORE function returns the address 
of the given node in G, if present, otherwise it stores the node and returns its 
address, and the FINDORPUT-CPU function first recursively checks whether the 
siblings of the node are stored, and if not, stores them, after which the node 
itself is stored. A node has pointers left and right to addresses of G, and there 
are functions to check for the existence of, and retrieve the siblings of a node. 

In the incremental approach, when creating a successor s” of a state s, the 
tree for s, say T (s), is used as the basis for the tree T(s’). When T(s’) is created, 
each node inside it is first initialised to the corresponding node in T(s), and the 
leaves are updated for the new tree. This ‘updated’ status propagates up: when 
a non-leaf has an updated sibling, its corresponding G pointer must be updated 
when T(s’) is stored in G, but for any non-updated sibling, the non-leaf can 
keep its G pointer. When incorporating the red underlined text in Alg. 1, the 
incremental version of the function is obtained. With this version, tree storage 
often results in fewer calls to STORE, i.e., fewer memory accesses. 

There are two main challenges when considering GPU incremental tree stor- 
age: 1) Recursion is detrimental to performance, as call stacks are stored in global 
memory (and with thousands of threads, a lot of memory would be needed for 
call stacks), and 2) The nodes of a tree tend to be spread all over the hash table, 
potentially leading to many random accesses. To address these, we propose a 
procedure in which threads in a block store sets of trees together in parallel. 


5.2 GPU Tree Generation 


When states are represented by trees, the tile of each thread block cannot store 
entire states, but it can store the roots of trees. To speed up successor generation, 
and avoid repeated uncoalesced global memory accessing, the trees of those roots 
are retrieved and stored in the shared memory (state cache) by the thread block. 
Once this has been done, successor generation can commence. 

Fig. 3 shows an example of the state cache evolving over time as a thread 
generates the successor s’ =<A,B’,C,D,E’> of s =<A,B,C,D,E>, with the trees 
as in Fig. 2. Each square represents a k-bit cache entry. In addition to two entries 
needed to store a node, we also use one (grey) entry to store two cache pointers 
or indices, and assume that k bits suffice to store two pointers (in practice, we 
use k = 32, which is enough, given the small size of the state cache). Hence, every 
pair of white squares followed by a grey square constitutes one cache slot. Initially 
(shown at the top of the figure), the tile has a cache pointer to the root of s, of 
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Fig. 3: Successor generation: deriving <A,B’,C,D,E’> from <A,B,C,D,E>. 


which we know that it contains the G addresses ag and a, to refer to its siblings. 
In turn, this root points, via its cache pointers, to the locally stored copies of 
its siblings. The non-leaf one contains the global address az. A leaf has no cache 
pointers, denoted by ‘-’. When creating s’, first, the designated thread constructs 
the leaf <A,B’>, by executing the appropriate generated CUDA function (see 
Section 4), and stores it in the cache. In Fig. 3, it is coloured black, to indicate 
that it is marked as new. Next, the thread creates a copy of <a2,E>, together 
with its cache pointers, and updates it to <a2,E’>. Finally it creates a new root, 
with cache pointers pointing to the newly inserted nodes. This root still has 
global address gaps to be filled in (the ‘?’ marks), since it is still unknown where 
the new nodes will be stored in G. 

The reason that we store global addresses in the cache is not to access the 
nodes they point to, but to achieve incremental tree storage: in the example, as 
the global address az is stored in the cache, there is no need to find <C,D> in 
G when the new tree is stored; instead, we can directly construct <a2,E’>. This 
contributes to limiting the number of required global memory accesses. 

Note that there is no recursion. Given a model, the code generator determines 
the structure of all state trees, and based on this, code to fetch all the nodes of a 
tree and to construct new trees is generated. As we do not consider the dynamic 
creation and destruction of FSMs, all states have the same tree structure. 


5.3 GPU Tree Storage at Block Level 


Once a block has finished generating the successors of the states referred to by 
its tile, the state cache content must be synchronised with G. Alg. 2 presents how 
this is done. The FINDORPUT-MANY function is executed by all threads in the 
block simultaneously. It consists of an outer while-loop (1.5-28), that is executed 
as long as there is work to be done. The code uses a cooperative group called bg, 
which is created to coincide with the size of a bucket (bucketsize). When no 
buckets are used, these groups can be interpreted as consisting of only a single 
thread each. At 1.4, the offset of each thread is determined, i.e., its ID inside 
its group, ranging from 0 to the size of the group. 

Every thread that still has work to do (1.5) enters the for-loop of 1.7-27, in 
which the content of the state cache is scanned. The parallel scanning works 
as follows: every thread first considers the node at position tid — offset of the 
cache, with tid being the thread’s block-local ID. This node is assigned to the 
thread with bg ID 0. If that index is still within the cache limits, all threads of 
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Algorithm 2: Tree-based Find-or-put-many, at thread block level. 


1 device function FINDORPUT-MANY(node_t* G): 

2 node_t p, q; index_t addr; bool work_to_do + true; bool ready; byte ballot_result 
3 auto bg + TILED-PARTITION(bucketsize) (THIS-THREAD-BLOCK()) 

4 byte offset + bg. THREAD-RANK() 

5 while work_to_do do 

6 work_to_do + false 

7 for i + tid — offset; i < CACHE_SIZE; i + i + BLOCK_SIZE do 

8 ready + false 


9 if i + offset < CACHE_SIZE then 

10 p + cache[i + offset] 

11 if IS-NEW-LEAF(p) then ready + true 

12 else if IS-NEW-NONLEAF(p) then 

13 if LEFT-GAP(p) then 

14 cache[i + offset] + SET-LEFT-GADDR(p, cache[LEFT-CADDR(p)]) 
15 if RIGHT-GAP(p) then 

16 cache[i + offset] + SET-RIGHT-GADDR(p, cache[RIGHT-CADDR(p)]) 
17 if —(LEFT-OR-RIGHT-GAP(p)) then ready + true 

18 else work_to_do + true 

19 ballot_result + bg.BALLOT(ready) 

20 while ballot_result do 

21 lane + FIND-FIRST-SET(ballot_result) - 1; q <— bg.SHUFFLE(p, lane) 

22 addr +— FINDORPUT-SINGLE(bg, G, q) 

23 if offset = lane then 

24 ready + false 

25 if addr = FULL then signal hash table full 

26 else SET-GADDR(cache[i], addr) 

27 ballot_result + bg.BALLOT(ready) 

28 work_to_do «+ bg.BALLOT(work_to_do) 


bg have to move along, regardless of whether they have a node to check or not. 
At the next iteration of the for-loop, the thread jumps over BLOCK_SIZE nodes 
as long as the index is within the cache limits. 

The main goal of this loop is to check which nodes are ready for synchroni- 
sation with G. Initially, this is the case for all nodes without global address gaps 
(see Subsection 5.2). Each thread first checks whether its own index is still within 
the cache limits (1.9). If so, the node p is retrieved from the cache at 1.10. If it is 
a new leaf, ready is set to true, to indicate that the active thread is ready for 
storage (1.11). If the node is a new non-leaf (1.12), it is checked whether the node 
still has global address gaps. If it has a gap for the left sibling (1.13), this left sib- 
ling is inspected via the cache pointer to this sibling (retrieved with the function 
LEFT-CADDR (1.14)). The function SET-LEFT-GADDR checks whether the cache 
pointers of that sibling have been replaced by a global memory address, and if 
so, uses that address to fill the gap. The same is done for the right sibling at 
1.15-16. If, after these operations, the node p contains no gaps (1.17), ready is 
set to true. If the node still contains a gap, another loop iteration is required, 
hence work_to_do is set to true (1.18). 

At 1.19, the threads in the group perform a ballot, resulting in a bit sequence 
indicating for which threads ready is true. As long as this is the case for at least 
one thread, the while-loop at 1.20-27 is executed. The function FIND-FIRST- 
SET identifies the least significant bit set to 1 in ballot_result (1.21), and 
the SHUFFLE instruction results in all threads in bg retrieving the node of the 
corresponding bg thread. This node is subsequently stored by bg, by calling 
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FINDORPUT-SINGLE (1.22) (explained later). Finally, the thread owning the node 
(1.23) resets its ready flag (1.24), and if the hash table is considered full, reports 
this globally (1.25). Otherwise, it records the global address of the stored node 
(1.26). After that, ballot_result is updated (1.27). Finally, once the for-loop is 
exited, the bg threads determine whether they still have more work to do (1.28). 


5.4 Single Node Storage at Bucket Group Level 


In this section, we address how individual nodes are stored by a cooperative 
group bg. Before we explain the algorithm for this, Alg. 3, in detail, we consider 
our options for hashing, and propose a novel combination of existing techniques. 

In Section 2, we argued that Cuckoo hashing is very effective on a GPU. 
However, as it frequently moves elements, it is not suitable for a single hash 
table, since the non-leaves of a tree refer to the positions of other nodes. We 
address this by maintaining two hash tables, one for tree roots, and one for 
the other nodes, as done in [26]. The roots are then not referred to, and hence 
Cuckoo hashing can be applied on the root table. 

In fact, when using two hash tables, we can be even more memory-efficient. 
In [26], it was shown that Cleary tables [13,15] can be very effective to store 
state spaces. To handle collisions in Cleary tables, order-preserving bidirectional 
linear probing [2] is used, which involves moving nodes to preserve their order. 
This makes Cleary tables, like Cuckoo hashing, not suitable to store entire trees, 
but they can be used to store the roots of the trees. In a Cleary table for roots 
of size 2k, each root r is hashed (bit scrambled) with a hash function h to a 2k 
bit sequence, from which w < k bits are taken to be used as the address to store 
r in a table with exactly 2” buckets, and at this position, the remaining 2k — w 
bits (the remainder) are actually stored. To enable decompression, h must be 
invertible; given a remainder and an address, h~! can be applied to obtain r. 

In a multi-threaded CPU context, this approach scales well [26], but the 
parallel approach of [26,45] divides a Cleary table into regions, and sometimes, 
a region must be locked by a thread to safely reorder nodes. Unfortunately, the 
use of any form of locking, also fine-grained locking implemented with atomic 
operations, is detrimental for GPU performance. Further, the absence of coherent 
caches in GPUs means that expensive global memory accesses may be needed 
when a thread repeatedly checks the status of an acquired lock. 

As an elegant alternative, we propose Cleary-Cuckoo hashing, which combines 
Cleary compression with Cuckoo hashing. We use m hash functions that are 
invertible (as with Cuckoo hashing) and capable of scrambling the bits of a 
root to a 2k bit sequence (as in Cleary tables). When we apply a function h; 
(0 <i < ™m) ona root r, we get a 2k bit sequence, of which we use w bits for an 
address d, and store at d the remainder r’ consisting of 2k — w + [log,(m)] +1 
bits. The [log,(m)] bits are needed to store the ID of the used hash function 
(i), and the final bit is needed to indicate that the root is new (unexplored). It 
is possible to retrieve r by applying he on d and r’ without the hash function 
ID and the new bit. When a collision occurs, the encountered root is evicted, 
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Algorithm 3: Single node find-or-put, at bucket group level. 


1 device function index_t FINDORPUT-SINGLE(tile_t bg, node_t* G, node_t p): 
2 node_t q; index_t addr 

3 (q, addr) +— FOP-CUCKOO-ROOT(bg, G, p) 

4 for i + 0; q Æ p andi < MAX EVICT;i<-i+1do 

5 (q, addr) +— FOP-CUCKOO-ROOT(bg, G, q) 

6 return (i = MAX_EVICT? FULL; addr) 


7 device function (node_t, index_t) FOP-CUCKOO-ROOT(tile_t bg, node_t* G, node_t p): 


8 comprnode_t cp, cq; node_t q 

9 hs + GET-HASH-START(p); byte offset + bg. THREAD-RANK() 

10 for i + 0; i < NUM_HASH_FUNCTIONS; i+ i + 1 do 

11 (addr, cp) <~ ADDR-COMPR-ROOT(p, h(ns+i) mod nomHasnFuncTIONS ) 
12 (cq, pos) + HT-FIND(bg, offset, G, addr, cp) 

13 if cq = cp then return (p, addr + pos) 

14 if cq = EMPTY then 

15 hs 4— h(ns+i) mod NUMHASH_FUNCTIONS 

16 break 

17 if i = NUM-HASH_-FUNCTIONS then (cp, addr) + ADDR-COMPR-ROOT(p, hs) 
18 (cq, pos) = HT-INSERT-CUCKOO(bg, offset, G, addr, cp) 

19 if cq A EMPTY and cq Æ cp then 

20 q + GET-DECOMPR-ROOT(cq, addr) 

21 return (q, addr + pos) 

22 return (p, addr + pos) 


decompressed, and stored again using the hash function next in line for that root. 
We refer to the application of Cleary compression to roots as root compression. 

Alg. 3 presents one version of the FINDORPUT-SINGLE function, to which a 
call in Alg. 2 is redirected when a root is provided. Here, G is a Cleary-Cuckoo 
table that is only used to store roots. In FINDORPUT-SINGLE, a second function 
FOP-CUCKOO-ROOT (1.7-22) is called repeatedly, as long as nodes are evicted 
or until the pre-configured MAX_EVICT has been reached, which prevents infinite 
eviction sequences (1.4). The function FOP-CUCKOO-ROOT returns the address 
where the given node was found or stored, and a node, which is either the node 
that had to be inserted or the one that was already present. 

In the FOP-CUCKOO-ROOT function, lines highlighted in purple are specific for 
root compression, i.e., Cleary compression of roots, while the green highlighted 
lines concern Cuckoo hashing, addressing node eviction. The ID of the first 
hash function to be used for node p, encoded in p itself, is stored in hs (1.9), 
and each thread determines its bg offset. Next, the thread iterates over the hash 
functions, starting with function hs (1.10-16). The G address and node remainder 
are computed at 1.11. If the node is new, the remainder is marked as new. If 
root compression is not used, we have p = cp. Then, the function HT-FIND is 
called to check for the presence of the remainder in the bucket starting at addr 
(1.12). If HT-FIND returns the remainder, then it was already present (1.13), and 
this can be returned. Note that the returned address is (addr + pos), i.e., the 
offset at which the remainder can be found inside the bucket is added to addr. 
Alternatively, if EMPTY is returned, the node is not present and the bucket is not 
yet full. In this case, a bucket has been found where the node can be stored. The 
used hash function is stored in hs (1.15) and the for-loop is exited (1.16). 

At 1.17, if a suitable bucket for insertion has not been found, the initial hs is 
selected again. At 1.18, the function HT-INSERT-CUCKOO is called to insert cp. 
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Algorithm 4: Single node insertion, at bucket group level. 


1 device function (comprnode_t, index_t) HT-INSERT-CUCKOO(tile_t bg, byte offset, node_t* 
G, index_t addr, comprnode_t cp): 

2 comprnode_t cq + Gladdr + offset]; byte ballot_result «+ bg.BALLOT(cq = cp) 

3 if ballot_result then return (cp, FIND-FIRST-SET(ballot_result) - 1) 

4 while ballot_result + bg.BALLOT(cq = EMPTY) do 

5 if offset = FIND-FIRST-SET(ballot_result) - 1 then 

6 cq + ATOMICCAS(G[addr + offset], EMPTY, cp) 

7 cq + bg.SHUFFLE(cq, FIND-FIRST-SET(ballot_result) - 1) 

8 if cq = EMPTY or cq = cp then return (cq, FIND-FIRST-SET(ballot_result) - 1) 

9 cq + Gladdr + offset] 


10 byte i 4+— GET-EVICTION-POS(cp) 

11 if offset = i then cq + ATOMICEXCH(G[addr + offset], cp) 
12 cq + bg.SHUFFLE(cq, i) 

13 return (cq, 7) 


This function is presented in Alg. 4. Finally, if a value other than the original 
remainder cp or EMPTY is returned, another (remainder of a) node has been 
evicted, which is decompressed and returned at 1.20-21. Otherwise, p is returned 
with its address (1.22). When Cuckoo hashing is not used, evictions do not occur, 
and at 1.20-21, it is returned that the bucket is full. 

Finally, we present HT-INSERT-CUCKOO in Alg. 4. The function HT-FIND is 
not presented, but it is almost equal to 1.2-3 of Alg. 4. At 1.2, each thread in 
bg reads its part of the bucket G [addr + offset], and checks if it contains cp, 
the remainder of p. If it is found anywhere in the bucket, the remainder with its 
position is returned (1.3). In the while-loop at 1.4-9, it is attempted to insert cp 
in an empty position. In every iteration, an empty position is selected (1.5) and 
the corresponding thread tries to atomically insert cp (1.6). At 1.7, the outcome 
is shared among the threads. If it is either EMPTY or the remainder itself, it can 
be returned (1.8). Otherwise, the bucket is read again (1.9). If insertion does not 
succeed, 1.10 is reached, where a hash function is used by GET-EVICTION-POS to 
hash cp to a bucket position. The corresponding thread exchanges cp with the 
node stored at that position (1.11). After the evicted node has been shared with 
the other threads (1.12), it is returned together with its position (1.13). 


6 Experiments 


We implemented a code generator in PYTHON, using TEXTX [17] and JINJA2,° 
that accepts an SLCO model and produces CUDA C++ code to explore its state 
space. The code is compiled with CUDA 11.4 targeting compute capability 7.5. 
Experiments were conducted on a machine running LINUX MINT 20 with a 
4-core INTEL CORE i7-7700 3.6 GHz, 32GB RAM, and a Titan RTX GPU. 

The goal of the experiments is to assess how fast GPU next state computation 
with the tree database is w.r.t. 1) the various options we have for hashing, 2) 
state-of-the-art CPU tools, and 3) other GPU tools. For 2), we compare with 
multi-core Depth-First Search (DFS) of SPIN 6.5.1 [22] and (explicit-state) multi- 
core Breadth-First Search (BFS) of LTSMIN 3.0.2 [24, 28]. 


3 https: //palletsprojects.com/p/jinja/. 


A GPU Tree Database for Many-Core Explicit State Space Exploration 697 


—*— cmp +il 
—— cemp+cutil 
—— butil 


—s— cmp + bu + il 


SLCO Models 


—— cmpt+but+cutil 
—— but i30 

—+— emp + i30 

—*— emp + cu + i30 
—— cmp + bu + i30 
—— cmp +bu+ cut i30 


0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 
Millions of states per seconds 


Fig. 4: Speed obtained by different GPU configurations. 


In our implementation, we use 32 invertible hash functions. Root compression 
(CMP) can be turned on or off. When selected, we have a root table with 23? 
elements, 32 bits each, and a non-root table with 2?° elements, 64 bits each. 
This enables storing 58-bit roots (two pointers to the non-root table) in 58 — 
32 + [log,(32)] + 1 = 32 bits. When using buckets with more than one element 
(CMP+BU), we have root buckets of size 8, and non-root buckets of size 16. The 
non-root buckets make full use of the cache line, but the root buckets do not. 
Making the latter larger means that too many bits for root addressing are lost 
for root compression to work (the remainders will be too large). 


Root compression allows turning Cuckoo hashing on (CMP(+BU)+CU) or off 
(CMP(+BU)). When it is off, essentially Cleary-Cuckoo is still performed, except 
that evictions are not allowed, meaning that hashing fails as soon as all possible 
32 buckets for a node are occupied. 


In the configuration BU, neither root compression nor Cuckoo hashing is 
applied. We use one table with 23° 64-bit elements and buckets of size 16. For 
reasons related to storing global addresses in the state cache, we cannot make 
the table larger. The 32 hash functions are used without allowing evictions. 


Finally, multiple iterations can be run per kernel launch. Shared memory is 
wiped when a kernel execution terminates, but the state cache content can be 
reused from one iteration to the next when a kernel executes multiple iterations, 
by which trees already in the cache do not need to be fetched again from the 
tree database. We identified 30 iterations to be effective in general (i30), and 
experimented with a single iteration per kernel launch (i1). 


With the CPU tools, we performed reachability analysis on 1- and 4-core 
configurations, denoted by Sp-1 and Sp-4 for SPIN, and LM-1 and LM-4 for 
LTSMIN. We only enabled state compression and basic reachability (without 
property checking), to favour fast exploration of large state spaces. 


For benchmarks, we used models from the BEEM benchmarks [42] of con- 
current systems, translated to SLCO and PROMELA (for SPIN). We scaled some 
of them up to have larger state spaces. Those are marked in Table 1 with ‘+’. 
Timeout is set to 3600 seconds for all benchmarks. 
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Table 1: Millions of states per second for various reachability tools and configura- 
tions. Pink cells: out of memory. Yellow cells: timeout. Green cell: best average. 
O.M.: out of memory at initialisation. SU: speedup of (CMP + i30) vs. (LM-1). 


Input CPU tools GPUsrxptore +Sico Configurations 

Model States | Sp-1 Sp-4 Lm-1 Lm-4|Bms CR PU CMP cMP+bU cMp+cu cme cmp+ou SU 
+i +i +i +i +130  +i30 

adding. 20+ 84,709,120 | 1.128 3.223 1.211 3.938| 100 1.96 49.597 56.793 48.879 36.934 74.026 47.604 61x 
adding.50+ 529,767,730 [0.856 OM. 1.354 5.356] 100 1.96 48.403 103.872 77.243 49.625 131.444 57.968 97x 
anderson.6 18,206,917 | 0.623 1.362 0.516 1.309] 122 1.82 14.814 16.035 13.647 11.265 34.111 17.649 62x 
anderson.7 538,699,029 |0.599 OM 0.448 1.583] 141 2.75 9.309 21.192 14.244 10.426 22.326 10.435 41x 
at.5 31,999,440 | 0.646 1.495 0.653 1.880] 85 1.86 19.894 29.158 23.633 18.204 38.457 21.375 59x 
at.6 160,589,600 | 0.454 0.869 0.695 2.387] 85 1.90 17.901 38.275 27.275 19.498 38.418 20.359 55x 
at.7 819,243,816 |0527 OM. 0.666 2.372| 97 1.98 12.415 23.629 17.381 13.194 22.329 13.378 34x 
at.8+ 3,739,953,204 | 0.534 o.m. 0.555 1.817| 97 1.97 | 5.452 7.246 7.593 11.698 | 7.287 11.854 13x 
bakery.5 7,866,401 | 1.400 2.570 0.410 0.904] 140 2.51 11.504 7.838 7.585 6.407 19.362 12.782 47x 
bakery.7 29,047,471 | 1.228 2.592 0.580 1.618] 140 2.49 13.236 9.361 9.021 7.698 29.783 17.456 51x 
bakery.8 841,696,300 | 0.760 1.269 0.690 2.436] 140 2.40 8.745 29.410 23.957 17.116 32.778 18.215 48x 
elevator2.3 7,667,712 | 0.554 1.099 0.463 0.985] 189 3.96 4.890 3.259 3.185 2817 6.261 4.827 14x 
elevator2.4 91,226,112 | 0.263 0.561 0.623 1.945] 213 3.97 3.025 3.746 2.907 3.087 3.267 2.703 5x 
elevator2.5+ 1,016,070,144 [0.189 OM. 0.473 1.630] 317 5.95) 1540 1.871 1.545 1.520 1.839 1.491 4x 
frogs.4 17,443,219 | 1.044 2.228 0.553 1.423] 219 3.49 8.423 10.253 8.686 7.767 11.549 8.168 21x 
frogs.5 182,772,126 | 0.531 1.048 0.751 2.630| 251 3.84 6.766 9.573 8.214 6.898 9.846 6.943 13x 
lamport.6 8,717,688 | 1.277 1.375 0.490 1.096] 96 1.91 11.813 5.126 5.225 4.697 27.966 19.335 57x 
lamport.7 38,717,846 | 1.001 1.822 0.672 1.979] 116 1.98 18.176 23.205 18.915 16.170 34.321 20.641 51x 
lamport.8 62,669,317 | 0.917 1.776 0.698 2.194] 116 1.98 17.717 25.947 21.015 17.132 35.387 20.864 50x 
loyd.2 362,880 | 1.278 0.758 0.255 0.497] 90 1.05 7.339 4.204 4.220 3.723 3.243 3.930 13x 
loyd.3 239,500,800 | 0. M 0.650 2.338] 114 1.96 18.268 44.073 28.970 26.556 48.328 28.248 74x 
mes.5 60,556,519 | 0. .615 0.453 1.489| 148 2.97 14.504 24.498 19.537 14.710 29.635 15.912 65x 
mes.6 0.181 0.331] 156 2.75 6.037 3.003 3.097 2.751 3.446 3.131 19x 
peterson.5 131,064,750 | 0.711 1.617 0.727 2.435| 140 2.98 16.034 31.975 21.394 17.813 32.331 16.681 42x 
peterson.6 174,495,861 | 0.852 0.756 0.720 2.451| 140 2.98 15.503 32.725 22.975 17.198 34.902 17.030 45x 
peterson.7 142,471,098 | 0.683 1.496 0.652 2.269| 175 2.63 13.077 25.667 18.603 13.868 26.183 13.120 37x 
phils.6 14,348,906 | 0.208 0.422 0.240 0.670] 150 1.49 4.410 7.458 5.528 4.789 7.084 4.543 30x 
phils.7 71,934,773 | 0.179 0.297 0.246 0.764] 151 1.49 3.585 5.702 4.762 4.064 5.382 3.885 22x 
phils.8 43,046,720 | 0.160 0.361 0.243 0.788] 160 1.49 4.842 9.151 6.987 5.119 8.973 5.089 37x 
szymanski.5 79,518,740 | 0.665 1.571 0.535 1.815] 180 2.91 11.944 17.803 14.416 11.653 18.357 11.674 33x 

Average 0.728 1.309 0.58 1.844 n/a 13.139 21.068 16.355 12.813 | 26.621) 15.246 40x 


Fig. 4 compares the speeds of the different GPU configurations in millions 
of states per second, averaged over 5 runs. For each configuration, we sorted 
the data to observe the overall trend. The higher the speed the better. The 
CMP +130 mode (without Cuckoo hashing or larger buckets) is the fastest for the 
majority of models. On the other hand, it fails to complete exploration for at .8, 
the largest state space with 3.7 billion states, due to running out of memory. If 
Cuckoo hashing is enabled with root compression, all state spaces are successfully 
explored, which confirms that higher load factors can be achieved [4]. However, 
Cuckoo hashing negatively impacts performance, which contradicts [4]. Although 
it is difficult to pinpoint the cause for this, it is clear that it results from our 
hashing being done in addition to the exploration tasks, while in papers on GPU 
hash tables [1,4], hashing is analysed in isolation. With the extra variables and 
operations needed for exploration, hashing should be lightweight, and Cuckoo 
hashing introduces handling evictions. The more complex code is compiled to a 
less performant program, even when evictions do not occur. 


Table 1 compares GPU performance with SPIN and LTSMIN. We refer to 
our tool as GPUEXPLORE+SLCO. From the results of Fig. 4, we selected a 
set of configurations demonstrating the impact of the various options. For each 
model, BITS and CR gives the state vector length in bits and the compression 
ratio, defined as (number of roots x number of leaves per tree) / (number of 
nodes). With the compression ratio, we measure how effective the node sharing 
is, compared to if we had stored each state individually without sharing. In 
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Table 2: Millions of states per second for various GPU tools. 


Tool anderson.6 anderson.7 lamport.8 peterson.5 peterson.6 peterson.7 szymanski.5 
GRAPPLE 2.138 14.299 n/a 10.941 9.074 8.967 n/a 
GPUEXPLORE 2.0 15.863 8.737 33.063 16.874 16.705 13.581 26.454 
GPUEXPLORE + SLCO (CMP+i30) 34.111 22.326 35.387 32.331 34.902 26.183 18.357 


addition, the speed in millions of states per second is given. Regarding out of 
memory, we are aware that SPIN has other, slower, compression options, but we 
only considered the fastest, to favour the CPU speeds. Times are restricted to 
exploration; code generation and compilation always take a few seconds. The 
best GPU results are highlighted in bold. To compute the speedup (SU), the 
result of CMP +130, the overall best configuration, has been divided by the LM-1 
result (the single-core configuration that completely explored all state spaces 
except one). All GPU experiments have been done with 512 threads per block, 
and 3,240 blocks (45 blocks per SM). We identified this configuration as being 
effective for anderson.6, and used it for all models. 

While LTSMIN tends to achieve near-linear speed-ups (compare LM-1 and 
LM-4), the speed of GPUEXPLORE+SLCO heavily depends on the model. For 
some models, as the state spaces of instances become larger, the speed increases, 
and for others, it decreases. The exact cause for this is hard to identify, and we 
plan to work on further optimisations. For instance, the branching factor, i.e., 
average number of successors of a state, plays a role here, as large branching 
factors favour parallel computation (many threads will become active quickly). 

Our overall fastest configuration does not use larger buckets, nor Cuckoo 
hashing. Regarding buckets, as already noted in Section 3, starting with the 
TURING architecture, NVIDIA GPUs are less sensitive to uncoalesced accesses, 
and our results confirm that. Performing fewer tasks in parallel seems to be more 
harmful for performance than a larger number of uncoalesced accesses. 

Finally, Table 2 compares GPUEXPLORE+ SLCO with GPUEXPLORE 2.0 and 
GRAPPLE. A comparison with PARAMOC was not possible, as it targets very dif- 
ferent types of (sequential) models. The models we selected are those available 
for at least two of the tools we considered. Unfortunately, GRAPPLE does not 
(yet) support reading PROMELA models. Instead, a number of models are en- 
coded directly into its source code, and we were limited to checking only those 
models. It can be observed that in the majority of cases, our tool achieves the 
highest speeds, which is surprising, as the trees we use tend to lead to more global 
memory accesses, but it is also encouraging to further pursue this direction. 


7 Conclusions and Future Work 


We discussed new algorithms to achieve a GPU tree database, which enables 
memory-efficient explicit state space exploration for FSMs with data. We pro- 
posed Cleary-Cuckoo hashing, which makes it possible to use, for the first time, 
Cleary compression on GPUs. Experiments show processing speeds of up to 131 
million trees per second. In the last decade, new GPUs have been increasingly 
effective for state space exploration [10], and in the future, they are expected to 
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be more capable of handling thread divergence, which still heavily occurs when 
accessing G. Therefore, we are optimistic about further improvements. In the 
future, we will focus on optimisations and verifying temporal logic formulae. 


Data Availability Statement. The datasets generated and analysed during 
the current study are available in the Zenodo repository [39]. 
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